I
Institute of Foundation Models

HPC Engineer

Sunnyvale, CA $150K–$300K Posted 2026-06-01
Salary
$150K–$300K
Type
Full-time
Source
Lever
• Monitor health, performance, and availability of large-scale GPU clusters. • Respond to incidents and perform first-level triage. • Support researchers and troubleshoot job failures. • Execute operational runbooks and recovery procedures. • Validate cluster deployments, upgrades, and maintenance activities. • Track infrastructure utilization and operational metrics. • Develop automation and monitoring tools. • Contribute to documentation and reporting.
Bachelor's degree in Computer Science, Computer Engineering, Software Engineering, Information Technology, Electrical Engineering, Mathematics, Physics, or related disciplines.
• 2+ years in Linux systems administration, SRE, DevOps, cloud operations, HPC, or infrastructure operations. • Strong Linux troubleshooting skills. • Experience with scripting using Python or Bash.
• Slurm. • GPU infrastructure. • AWS, Azure, or GCP. • Grafana, Prometheus, Datadog, or similar tools. • Containers and Kubernetes. • AI/ML infrastructure exposure. • Research computing environments.
PythonAWSAzureGCPKubernetes
Institute of Foundation Models is hiring for the hpc engineer role. NewJob aggregates active openings directly from Institute of Foundation Models's applicant tracking system, so this listing is current. More jobs at Institute of Foundation Models →
Apply on company site