Zoox, Incorporated

Senior AI Inference Engineer - Model Optimization & Deployment

Zoox, Incorporated · Foster City, CA
Foster City, CA was $242K–$290K Closed
Applications are closed for this role. It was originally posted 2026-04-11. It’s no longer accepting applicants — see roles Zoox, Incorporated is still hiring for →, or browse the live openings below.
Salary
$242K–$290K
Type
Full-time
Experience
5+ yr

The Perception team is pioneering the development of a multi-modality foundation model to drive the next generation of autonomous system intelligence.

As a Model Optimization & Deployment Engineer, you will focus on bringing highly efficient, production-ready large-scale models to our on-vehicle stack. We are looking for experts with hands-on experience in compressing, accelerating, and deploying complex models (LLMs, VLMs, or FMs) for power- and thermal-constrained vehicle SOCs. You will optimize the ML models, write custom CUDA kernels, and build highly concurrent inference code to ensure real-time, deterministic execution on edge devices.

About Zoox
Zoox is developing the first ground-up, fully autonomous vehicle fleet and the supporting ecosystem required to bring this technology to market. Sitting at the intersection of robotics, machine learning, and design, Zoox aims to provide the next generation of mobility-as-a-service in urban environments. We’re looking for top talent that shares our passion and wants to be part of a fast-moving and highly execution-oriented team.

Follow us on LinkedIn

Accommodations
If you need an accommodation to participate in the application or interview process please reach out to [email protected] or your assigned recruiter.

A Final Note:
You do not need to match every listed expectation to apply for this position. Here at Zoox, we know that diverse perspectives foster the innovation we need to be successful, and we are committed to building a team that encompasses a variety of backgrounds, experiences, and skills.

  • Optimize large-scale models (Multi-Modal Sensor Fusion models, LLMs, VLMs) using advanced quantization (PTQ, QAT), pruning, mixed-precision inference frameworks, and parameter-efficient fine-tuning (LoRA, QLoRA).
  • Architect and implement model conversion and compilation pipelines using TensorRT for edge deployment.
  • Perform rigorous parity checking, accuracy recovery, and latency benchmarking between PyTorch frameworks and compiled edge binaries.
  • Develop and optimize custom ML OPs and TensorRT Plugins with efficient CUDA kernels to minimize latency and maximize memory bandwidth on AI accelerators.
  • Write production-level, low latency, and memory-safe C++ and CUDA code for real-time inference on vehicle systems.
  • Deep expertise in model quantization (PTQ, QAT) and mixed-precision inference frameworks (INT8, FP8, FP4, BF16/FP16).
  • Proven experience optimizing large-scale models (Multi-Modal Sensor Fusion models, LLMs, VLMs/VLAs) utilizing Efficient Attention mechanisms (e.g., FlashAttention, Linear Attention), KV-cache optimization (e.g., PagedAttention) and Speculative Decoding.
  • Extensive experience with model conversion/compilation pipelines (e.g., ONNX, TensorRT, torch.compile) and performing rigorous latency benchmark and model quality parity valuation.
  • Proficiency in low-level programming for AI accelerators, specifically developing and optimizing custom ML OPs and TensorRT Plugins with efficient CUDA kernel implementations.
  • Production-level C++ (14/17/20) and Python programming skills, with experience developing concurrent, memory-safe, real-time inference code for edge devices.
  • Familiarity with SOTA autonomous driving perception algorithms (temporal 3D object detection, BEV, 3D Occupancy Networks) and multi-modal sensor processing (Vision, LiDAR, Radar).
  • Experience with distributed training pipelines and model/tensor parallelism (PyTorch Distributed, Ray, DeepSpeed, Megatron-LM) and runtime efficiency optimization for GPU clusters.
  • Experience with end-to-end autonomous driving paradigms (VLM/VLA models, Foundation models) and edge deployment technologies (e.g., TensorRT-LLM).

Base Salary Range

There are three major components to compensation for this position: salary, Amazon Restricted Stock Units (RSUs), and Zoox Stock Appreciation Rights. A sign-on bonus may be offered as part of the compensation package. The listed range applies only to the base salary. Compensation will vary based on geographic location and level. Leveling, as well as positioning within a level, is determined by a range of factors, including, but not limited to, a candidate's relevant years of experience, domain knowledge, and interview performance. The salary range listed in this posting is representative of the range of levels Zoox is considering for this position.

Zoox also offers a comprehensive package of benefits, including paid time off (e.g. sick leave, vacation, bereavement), unpaid time off, Zoox Stock Appreciation Rights, Amazon RSUs, health insurance, long-term care insurance, long-term and short-term disability insurance, and life insurance.

PyTorchPythonLLMC++
O
Vehicle Line Director
Foster City, CA Hybrid
Operations
$292K–$351K
O
Manager, RL Algorithms & Decoder
Foster City, CA Hybrid
Operations
$277K–$349K
E
Senior/Staff Software Engineer - Dynamic Agent Interactions
Foster City, CA Hybrid
Engineering
$242K–$349K
See all live roles at Zoox, Incorporated →
G
Senior Software Engineer - AI Compute Libraries & Performance
Graphcore Limited Gdańsk, Pomeranian Voivodeship, Poland
Engineering
$260K–$352K
C
Senior Software Engineer - Perf and Benchmarking
CoreWeave Sunnyvale, CA
Engineering
$182K–$242K
T
LLM Inference Frameworks and Optimization Engineer
Together Computer San Francisco, CA
Engineering
$160K–$230K
C
Inference System & Performance Engineer - Member of Technical Staff
Callosum Technologies London, UK
Engineering
$101K–$192K
See all Engineering roles →
Applications closed