F
Foresite Labs

Staff Engineer, CI/CD & Cloud Infrastructure

San Diego $175K–$185K Posted 2026-05-01
Salary
$175K–$185K
Type
Full-time
Experience
7+ yr
Source
Ashby
Staff Engineer, CI/CD & Cloud Infrastructure

Location: San Diego, CA

Job Type: Full-Time

Salary Range: $ 175,000 - $185,000

Position Overview

We are looking for a Staff CI/CD & Cloud Infrastructure Engineer to own and evolve our build pipelines, deployment workflows, and cloud infrastructure. You will be responsible for ensuring that software — spanning Python, C/C++, and CUDA on Linux — is built, tested, versioned, and deployed reliably across both AWS cloud environments and a fleet of complex embedded instruments operated in our central lab facility.

This is a senior hands-on role for an engineer who thrives at the intersection of DevOps automation, cloud infrastructure management, and release engineering. You will design and maintain CI/CD pipelines, manage complex AWS infrastructure as code, and ensure full traceability from source commits through builds, tests, artifacts, and deployments. You will work cross-functionally with firmware, application, and HPC engineers to keep the entire delivery pipeline fast, reliable, and observable.

Key Responsibilities

CI/CD & Build Engineering

- Design, build, and maintain CI/CD pipelines using GitHub Actions or similar platforms

- Manage build systems for Python, C/C++, and CUDA codebases on Linux

- Integrate build tools (CMake, Make, pip, setuptools) into automated pipelines

- Implement robust versioning, tagging, and artifact management strategies

- Ensure full traceability of builds, test results, and artifacts from commit to deployment

- Manage Docker-based build environments including base images, caching, and reproducibility

- Maintain and optimize build performance, parallelism, and reliability

Cloud Infrastructure (AWS)

- Architect and manage complex AWS infrastructure including:

- IAM roles, policies, and access management

- Storage services (S3, EBS, EFS) with tiered lifecycle policies

- Databases (RDS, DynamoDB, or similar) with backup and

failover strategies

- Data workflow and pipeline engines (Step Functions, Airflow, or

similar)

- Compute services (EC2, ECS, EKS, Lambda) scaled to workload

requirements

- Implement infrastructure as code using Terraform

- Manage Kubernetes clusters and Helm charts for containerized

- workloads

- Design for scalability, high availability, and disaster recovery

- Manage cost optimization, resource tagging, and infrastructure

- governance

- Support multi-account and multi-region strategies as needed

- Familiarity with Azure and GCP for secondary or hybrid

- requirements

On-Premises HPC & Hybrid Infrastructure

- Provision, configure, and manage on-premises Linux HPC nodes used for secondary and tertiary data processing

- Define infrastructure-as-code (Terraform, Ansible, or similar) for reproducible HPC node provisioning and configuration

- Manage high-speed networking infrastructure between instruments, HPC nodes, and storage (configuration, monitoring, troubleshooting)

- Implement and manage shared storage systems (NFS, parallel filesystems, or similar) accessible to both local HPC and cloud compute

- Design and operate hybrid burst-to-cloud infrastructure — provision and manage AWS compute resources that extend local HPC capacity on demand

- Collaborate with the data pipeline team to ensure infrastructure meets throughput, latency, and reliability requirements

- Manage OS patching, driver updates, and GPU runtime environments across HPC nodes

- Monitor HPC cluster health, utilization, and capacity to inform scaling decisions

Experiment Data Management & Pipelines

- Design and operate data ingestion pipelines for high-volume experiment data from lab instruments

- Implement tiered storage strategies (hot/warm/cold) to balance accessibility, performance, and cost

- Deploy and manage search infrastructure (Elasticsearch/ OpenSearch) to make experiment data universally discoverable and queryable

- Build data cataloging and metadata tagging systems so datasets are well-organized and self-describing

- Integrate visualization tools (Grafana, Kibana, or similar) to enable engineers and scientists to explore and analyze experiment data

- Design data lifecycle policies including retention, archival, and compliance requirements

- Ensure data pipelines are reliable, idempotent, and observable with clear error handling and retry logic

- Work with engineering and science teams to define data schemas, access patterns, and query requirements

Deployment & Release Engineering

- Own deployment workflows for software delivered to embedded instruments in our central lab

- Manage release processes for a small number of complex, high- value lab-operated instruments

- Design deployment strategies that account for rollback, validation, and minimal downtime

- Coordinate versioned releases across multiple software components and dependencies

- Support development, staging, and production environment parity

Logging, Observability & Traceability

- Implement centralized log collection and aggregation across cloud and on-site systems

- Deploy and manage observability tooling (Prometheus, Grafana, Loki, CloudWatch, or similar)

- Ensure structured, searchable logging with clear correlation across services

- Build dashboards and alerting for infrastructure health, pipeline status, and deployment state

- Establish traceability standards linking builds, tests, artifacts, and deployments

- Support diagnostics and post-mortem analysis for production incidents

AI-Augmented DevOps

- Integrate agentic AI tools into CI/CD workflows to automate code review, test generation, and pipeline troubleshooting

- Evaluate and deploy AI-powered assistants for infrastructure management, incident response, and operational tasks

- Design guardrails and human-in-the-loop controls for AI-driven automation in production environments

- Stay current with the rapidly evolving landscape of AI-augmented development and DevOps tooling

- Champion adoption of agentic AI across engineering workflows to accelerate delivery and improve reliability

Qualifications

Education:

BS/MS in Computer Science or Engineering

Required:

- Experience & Technical Skills

- 7+ years of experience in DevOps, CI/CD, or cloud infrastructure roles

- Strong, hands-on Linux expertise (administration, debugging, performance tuning)

- Deep experience designing and operating CI/CD pipelines (GitHub Actions preferred)

- Proven experience managing complex AWS infrastructure at scale

- Strong knowledge of Docker including multi-stage builds, registries, and orchestration

- Experience with infrastructure as code using Terraform

- Experience with Kubernetes and Helm for container orchestration

- Solid understanding of versioning strategies, artifact management, and release engineering

- Experience integrating agentic AI into DevOps workflows and CI/CD pipelines



Programming & Build Systems

- Proficiency in Python and shell scripting for automation and tooling

- Ability to read, debug, and build C/C++ and CUDA applications on Linux

- Experience integrating build systems (CMake, Make) into CI pipelines

- Familiarity with package management and dependency resolution across languages



Cloud & Infrastructure

- Deep AWS experience across IAM, networking (VPC, security groups), storage, compute, and database services

- Experience managing on-premises Linux HPC infrastructure alongside cloud resources

- Experience designing for high availability, failover, and disaster recovery

- Experience with data pipeline and workflow orchestration tools (Step Functions, Airflow, or similar)

- Experience with search and indexing platforms (Elasticsearch, OpenSearch, or similar)

- Understanding of tiered storage strategies and data lifecycle management

- Knowledge of cost management, tagging strategies, and infrastructure governance



Observability & Traceability

- Experience with logging and monitoring stacks (Prometheus,

- Grafana, Loki, ELK, or CloudWatch)

- Understanding of build and artifact traceability practices

- Experience with structured logging and distributed tracing concepts



Preferred:

- Experience deploying software to embedded or lab-operated instruments

- Experience with high-speed networking (InfiniBand, RDMA, or 10/25/100GbE) in HPC environments

- Experience with CUDA build toolchains and GPU-accelerated workloads

- Familiarity with Azure or GCP in addition to AWS

- Experience in regulated or reliability-sensitive environments

- Experience with GitOps workflows and progressive delivery

strategies

- Familiarity with secrets management (Vault, AWS Secrets Manager)


We are an equal opportunity employer. We thrive on diversity and collaboration.
PythonAWSDockerAirflowTerraformKubernetes
Foresite Labs is hiring for the staff engineer, ci/cd & cloud infrastructure role. NewJob aggregates active openings directly from Foresite Labs's applicant tracking system, so this listing is current. More jobs at Foresite Labs →
Apply on company site