F
Fabrion

ML Ops Engineer — Agentic AI Lab

San Francisco, CA Posted 2025-08-11
Type
Full-time
Source
Ashby
ML Ops Engineer — Agentic AI Lab (Founding Team)

Location: San Francisco Bay Area

Type: Full-Time

Compensation: Competitive salary + meaningful equity (founding tier)

Backed by 8VC, we're building a world-class team to tackle one of the industry’s most critical infrastructure problems.

ABOUT THE ROLE

Our AI Lab is pioneering the future of intelligent infrastructure through open-source LLMs, agent-native pipelines, retrieval-augmented generation (RAG), and knowledge-graph-grounded models.

We’re hiring an ML Ops Engineer to be the glue between ML research and production systems — responsible for automating the model training, deployment, versioning, and observability pipelines that power our agents and AI data fabric.

You’ll work across compute orchestration, GPU infrastructure, fine-tuned model lifecycle management, model governance, and security e

Responsibilities

- Build and maintain secure, scalable, and automated pipelines for:

- LLM fine-tuning, SFT, LoRA, RLHF, DPO training

- RAG embedding pipelines with dynamic updates

- Model conversion, quantization, and inference rollout

- Manage hybrid compute infrastructure (cloud, on-prem, GPU clusters) for training and

inference workloads using Kubernetes, Ray, and Terraform

- Containerize models and agents using Docker, with reproducible builds and CI/CD via

GitHub Actions or ArgoCD

- Implement and enforce model governance: versioning, metadata, lineage, reproducibility,

and evaluation capture

- Create and manage evaluation and benchmarking frameworks (e.g. OpenLLM-Evals,

RAGAS, LangSmith)

- Integrate with security and access control layers (OPA, ABAC, Keycloak) to enforce

model policies per tenant

- Instrument observability for model latency, token usage, performance metrics, error

tracing, and drift detection

- Support deployment of agentic apps with LangGraph, LangChain, and custom inference

backends (e.g. vLLM, TGI, Triton)

DESIRED EXPERIENCE

Model Infrastructure:

- 4+ years in MLOps, ML platform engineering, or infra-focused ML roles

- Deep familiarity with model lifecycle management tools: MLflow, Weights & Biases, DVC,

- HuggingFace Hub

- Experience with large model deployments (open-source LLMs preferred): LLaMA,

- Mistral, Falcon, Mixtral

- Comfortable with tuning libraries (HuggingFace Trainer, DeepSpeed, FSDP, QLoRA)

- Familiarity with inference serving: vLLM, TGI, Ray Serve, Triton Inference Server



Automation + Infra:

- Proficient with Terraform, Helm, K8s, and container orchestration

- Experience with CI/CD for ML (e.g. GitHub Actions + model checkpoints)

- Managed hybrid workloads across GPU cloud (Lambda, Modal, HuggingFace Inference,

- Sagemaker)

- Familiar with cost optimization (spot instance scaling, batch prioritization, model sharding)



Agent + Data Pipeline Support:●

Familiarity with LangChain, LangGraph, LlamaIndex or similar RAG/agent orchestration tools

Built embedding pipelines for multi-source documents (PDF, JSON, CSV, HTML)

Integrated with vector databases (Weaviate, Qdrant, FAISS, Chroma)

Security & Governance:

Implemented model-level RBAC, usage tracking, audit trails

Integrated with API rate limits, tenant billing, and SLA observability

Experience with policy-as-code systems (OPA, Rego) and access layers

Preferred Stack

- LLM Ops: HuggingFace, DeepSpeed, MLflow, Weights & Biases, DVC

- Infra: Kubernetes (GKE/EKS), Ray, Terraform, Helm, GitHub Actions, ArgoCD

- Serving: vLLM, TGI, Triton, Ray Serve

- Pipelines: Prefect, Airflow, Dagster

- Monitoring: Prometheus, Grafana, OpenTelemetry, LangSmith

- Security: OPA (Rego), Keycloak, Vault

- Languages: Python (primary), Bash, optionally Rust or Go for tooling



Mindset & Culture Fit

- Builder's mindset with startup autonomy: you automate what slows you down

- Obsessive about reproducibility, observability, and traceability

- Comfortable with a hybrid team of AI researchers, DevOps, and backend engineers

- Interested in aligning ML systems to product delivery, not just papers

- Bonus: experience with SOC2, HIPAA, or GovCloud-grade model operations

WHAT WE’RE LOOKING FOR

Experience:

- 5+ years as a full stack or backend engineer

- Experience owning and delivering production systems end-to-end

- Prior experience with modern frontend frameworks (React, Next.js)

- Familiarity with building APIs, databases, cloud infrastructure, or deployment workflows at scale

- Comfortable working in early-stage startups or autonomous roles, prior experience as a founder, founding engineer, or a 0-1 pre-seed startup is a big plus

Mindset:

- Comfortable with ambiguity, eager to prototype and iterate quickly

- Strong sense of ownership — prefers to build systems rather than wait for tickets

- Enjoys thinking about architecture, performance, and tradeoffs at every level

- Clear communicator and pragmatic team player

- Values equity and impact over prestige or hierarchy

- Prior startup or founding team experience



WHY THIS ROLE MATTERS

Your work will enable models and agents to be trained, evaluated, deployed, and governed at

scale — across many tenants, models, and tasks. This is the backbone of a secure, reliable,

and scalable AI-native enterprise system. If you dream about using AI to solve some really hard

real world problems – we would love to hear from you.
LLMKubernetesTerraformDockerAirflowPython
Fabrion is hiring for the ml ops engineer — agentic ai lab role. NewJob aggregates active openings directly from Fabrion's applicant tracking system, so this listing is current. More jobs at Fabrion →
Apply on company site