Type

Full-time

Source

Ashby

About this role

ABOUT THE ROLE

We’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale. You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users.

This role is ideal for someone who enjoys deep technical work, profiling systems down to the kernel/GPU level, and translating research ideas into production-grade performance gains.

WHAT YOU’LL DO

- Optimize inference latency, throughput, and cost for large-scale ML models in production

- Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)

- Implement and tune techniques such as:

- Quantization (fp16, bf16, int8, fp8)

- KV-cache optimization & reuse

- Speculative decoding, batching, and streaming

- Model pruning or architectural simplifications for inference

- Collaborate with research engineers to productionize new model architectures

- Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)

- Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups

- Improve system reliability, observability, and cost efficiency under real workloads

WHAT WE’RE LOOKING FOR

- Strong experience in ML inference optimization or high-performance ML systems

- Solid understanding of deep learning internals (attention, memory layout, compute graphs)

- Hands-on experience with PyTorch (or similar) and model deployment

- Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)

- Experience scaling inference for real users (not just research benchmarks)

- Comfortable working in fast-moving startup environments with ownership and ambiguity

NICE TO HAVE

- Experience with LLM or long-context model inference

- Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)

- Experience optimizing across different hardware vendors

- Open-source contributions in ML systems or inference tooling

- Background in distributed systems or low-latency services

WHY JOIN US

- Real ownership over performance-critical systems

- Direct impact on product reliability and unit economics

- Close collaboration with research, infra, and product

- Competitive compensation + meaningful equity at Series A

- A team that cares about engineering quality, not hype

Tech stack

PyTorchLLM

About Featherless AI

Featherless AI is hiring for the machine learning engineer — inference optimization role. NewJob aggregates active openings directly from Featherless AI's applicant tracking system, so this listing is current. More jobs at Featherless AI →

Machine Learning Engineer — Inference Optimization