Type

Full-time

Experience

5+ yr

About this role

SENIOR BACKEND ENGINEER

San Francisco · On Site · Full Time

Judgment Labs is building the infrastructure for continual learning in long-horizon AI agents.

The next generation of agents will not improve from prompts alone. They will improve from experience: the tasks they attempt, the tools they use, the mistakes they make, the edge cases they encounter, and the outcomes they produce in production. The hard part is turning that raw experience into high-quality data that can actually improve the system.

Judgment builds the infrastructure to do that. We turn long agent trajectories into clean, structured data for evals, labeling, rubric generation, context engineering, and RL workflows. Instead of only showing teams what happened, Judgment helps decide what matters, what should be learned from, and how that learning should flow back into the agent.

Databricks built the data infrastructure for analytics. Judgment is building the learning infrastructure for agents.

We’ve raised $30M+ from Lightspeed, SV Angel, Valor Equity Partners, and others.

THE ROLE

We’re looking for a Senior Backend Engineer to own the systems that ingest, structure, evaluate, and serve agent experience data at production scale.

This role includes the backend and data infrastructure surface area: high-throughput telemetry ingestion, ClickHouse-backed OLAP performance, evaluation pipelines, RabbitMQ/Temporal workflows, multi-tenant scheduling, and product-facing APIs. Some weeks you’ll be deep in distributed systems and query performance. Other weeks you’ll ship a customer-facing feature end to end across backend, frontend, and the data layer.

This is not a narrow API role. The backend is where raw agent trajectories become structured learning data.

INTERESTING TECHNICAL CHALLENGES

High-throughput telemetry ingestion. Parse and persist OTEL traces across protobuf and JSON formats at hundreds of thousands of spans per second, writing to ClickHouse while keeping ingest latency low and backpressure graceful as customer traffic spikes.

Petabyte-scale OLAP performance. Design schemas, partitioning, indexes, storage layouts, and query paths so behavioral queries over billions of spans stay fast. Turn real access-pattern analysis into concrete data modeling decisions.

Long-horizon trajectory modeling. Agent workflows are messy: multi-step tasks, tool calls, retries, partial failures, context changes, and unclear outcomes. Build the abstractions that turn those trajectories into structured data for evals, labeling, rubric generation, context engineering, and RL workflows.

Queue- and workflow-driven evaluation. Evaluations fan out across RabbitMQ and Temporal workflows. Getting this right means reasoning about retries, timeouts, idempotent state transitions, exactly-once-ish semantics, and reconciling runs that fail partway so nothing is silently orphaned.

Multi-tenant fairness at scale. A single large customer should not be able to starve everyone else. Build scheduling and execution systems so latency stays predictable across hundreds of teams sharing the same evaluation pipeline.

Near-real-time scoring. Behavioral scorers and agent judges call LLM APIs at scale, so batching, rate-limit management, retry/backoff, failure handling, and cost control are core backend systems problems.

Learning loops for agents. Build the product and systems layer that helps teams decide what matters, what should be learned from, and how that learning flows back into the agent.

WHAT YOU’LL DO

Design and build backend systems for trace ingestion, trajectory processing, evaluation orchestration, scoring, labeling, rubric generation, and customer-facing analytics.

Own the API surface used by the Judgment platform UI, SDKs, JudgmentHub libraries, MCP server, Slack agent, and customer integrations.

Build and operate the RabbitMQ / Temporal evaluation pipeline, including retry semantics, failure recovery, state reconciliation, and tenant-level scheduling.

Optimize the ClickHouse OLAP layer: schema design, partitioning, skip indexes, full-text-search pruning, query rewrites, deduplication, pagination correctness, and storage growth.

Turn raw spans, conversations, tool calls, scorer outputs, and agent-judge results into clean data models customers can use for evals, labeling, context engineering, and RL workflows.

Ship features end to end, often across Next.js, backend APIs, queues/workflows, and the data layer.

Work directly with customers to understand where their agents fail, what data is useful, and how Judgment should structure that experience for learning.

Roll out safely with feature flags, design docs, code reviews, tests, observability, and production debugging.

Raise the engineering bar through clear interfaces, maintainable systems, thoughtful reviews, and strong ownership.

WHAT WE’RE LOOKING FOR

Strong backend engineering experience building and operating production systems under real load.

Excellent fundamentals in distributed systems, API design, data modeling, reliability, and performance.

Experience working with high-volume event, trace, log, metric, or telemetry data.

Strong intuition for data systems: query patterns, storage layout, indexing, partitioning, latency, correctness, and cost.

Comfort owning systems beyond initial launch: debugging production issues, improving observability, scaling bottlenecks, and cleaning up abstractions as the product evolves.

Ability to work across backend, data, product, and infrastructure boundaries rather than treating them as separate silos.

Product judgment and willingness to ship across the stack when needed.

Clear communication. You can write a design doc, review a diff, explain a tradeoff, and unblock others without turning everything into process.

NICE TO HAVE

Experience with ClickHouse, OLAP systems, distributed query engines, or large-scale analytical databases.

Experience with RabbitMQ, Temporal, Kafka, Spark, Flink, Ray, Airflow, Dagster, Prefect, or similar queue/workflow/data systems.

Experience with OTEL, observability products, tracing, logging, or monitoring infrastructure.

Experience building systems that call LLM APIs at scale, including rate-limit management, retries, batching, and cost control.

Experience with LLM evaluation, labeling systems, rubric generation, context engineering, RL data pipelines, embedding pipelines, vector search, clustering, or anomaly detection.

Experience building developer-facing products, SDK-backed platforms, or customer-facing infrastructure.

WHY JUDGMENT?

We’re building the learning infrastructure for agents. As agents move from demos to production, the bottleneck is no longer just better prompts. It is turning real production experience into high-quality data for evals, labeling, rubric generation, context engineering, and RL workflows.

The technical problems are foundational. Long agent trajectories are messy, high-volume, and hard to reason about. We’re building the systems that ingest them, structure them, evaluate them, surface what matters, and feed that learning back into the agent.

This is a Databricks-scale infrastructure opportunity. Databricks built the data infrastructure for analytics. Judgment is building the learning infrastructure for agents.

You’ll work on problems customers actually feel. Engineers talk directly to teams building production agents, see where their systems fail, and turn those failures into product and infrastructure.

Small team, high ownership. You will not own a narrow slice. You’ll shape core systems early, ship quickly, and work across product, data, backend, infra, and customer environments.