About this role
Benchmarking & evaluation • Design and run evaluations of agentic capabilities — multi-step reasoning, tool use, long-horizon planning, computer use, and safety properties — turning ambiguous notions of "intelligence" into defensible, reproducible metrics. • Build and harden evaluation harnesses so benchmarks run reliably at scale against training checkpoints, with clear signal on regressions and model health. • Run experiments characterizing how prompting, sampling, scaffolding, and environment design affect agentic performance on internal and public benchmarks. • Diagnose anomalous eval results mid-training-run — determine whether the cause is the model, the data, the harness, or the infrastructure — and communicate the answer clearly. Agentic data • Source, generate, and curate high-quality agentic training data: trajectories, tool-use traces, and task datasets for new capabilities. • Design and scale RL environments and reward signals, and measure their impact on model performance. • Manage technical relationships with external data vendors and domain experts, evaluating data quality and iterating quickly on feedback. • Develop QA frameworks that catch reward hacking, label noise, and contamination, keeping data and benchmark quality high. Across both • Contribute to technical reports, research publications, and open-source benchmarks and tooling. • Partner with research and product teams to translate capability goals into measurable data and evaluation artifacts.
Academic qualifications • BS, MS, or PhD (or equivalent experience) in Computer Science, Machine Learning, or a related field. Minimum qualifications • 2+ years of experience with a clear emphasis on evaluations and/or training-data curation for ML systems (related areas: LLM training/fine-tuning, RL, or distributed ML systems). • Strong Python and PyTorch development experience. • Demonstrated experience designing and deep-diving into evaluations, or curating and generating training datasets — ideally both. • Hands-on experience using LLM agents in your personal or professional work. • A habit of reading through raw data and trajectories to understand them and spot issues, and an instinct to distrust a metric until it's validated. Preferred qualifications • Experience with reinforcement learning, reward design, or RL environment construction for LLMs. • Background in statistics and experimental design — a feel for signal-to-noise, statistical power, and contamination in evaluations. • Experience with large-scale dataset sourcing, curation, and processing, including working with external vendors or domain experts. • Strong knowledge of the literature on agent evaluation, RL, LLM reasoning, and tool use. • Experience building or operating data pipelines and evaluation infrastructure reliable at scale (e.g., PyTorch, Ray). • Experience evaluating or generating data for software-engineering or computer-use agents. • Contributions to published research, public benchmarks, and/or open-source ML software.
• Stand up a new agentic benchmark from scratch — define the task, build the dataset and scoring, validate against known signals, and ship a view that makes the result legible to researchers and leadership. • Build an RL environment for a new high-value capability: design the reward, generate and QA the trajectory data, and measure the lift on model performance. • Diagnose a mid-training regression: an eval suite returns anomalous numbers and you determine whether it's the model, the harness, the data, or the infrastructure. • Partner with an external data vendor or domain expert to source high-quality trajectories, then build the QA framework that keeps reward hacking and contamination out. • Take a flaky distributed eval pipeline and make it reliable — better retries, better observability, faster feedback to researchers.
Academic qualifications • BS, MS, or PhD (or equivalent experience) in Computer Science, Machine Learning, or a related field. Minimum qualifications • 2+ years of experience with a clear emphasis on evaluations and/or training-data curation for ML systems (related areas: LLM training/fine-tuning, RL, or distributed ML systems). • Strong Python and PyTorch development experience. • Demonstrated experience designing and deep-diving into evaluations, or curating and generating training datasets — ideally both. • Hands-on experience using LLM agents in your personal or professional work. • A habit of reading through raw data and trajectories to understand them and spot issues, and an instinct to distrust a metric until it's validated. Preferred qualifications • Experience with reinforcement learning, reward design, or RL environment construction for LLMs. • Background in statistics and experimental design — a feel for signal-to-noise, statistical power, and contamination in evaluations. • Experience with large-scale dataset sourcing, curation, and processing, including working with external vendors or domain experts. • Strong knowledge of the literature on agent evaluation, RL, LLM reasoning, and tool use. • Experience building or operating data pipelines and evaluation infrastructure reliable at scale (e.g., PyTorch, Ray). • Experience evaluating or generating data for software-engineering or computer-use agents. • Contributions to published research, public benchmarks, and/or open-source ML software.
• Stand up a new agentic benchmark from scratch — define the task, build the dataset and scoring, validate against known signals, and ship a view that makes the result legible to researchers and leadership. • Build an RL environment for a new high-value capability: design the reward, generate and QA the trajectory data, and measure the lift on model performance. • Diagnose a mid-training regression: an eval suite returns anomalous numbers and you determine whether it's the model, the harness, the data, or the infrastructure. • Partner with an external data vendor or domain expert to source high-quality trajectories, then build the QA framework that keeps reward hacking and contamination out. • Take a flaky distributed eval pipeline and make it reliable — better retries, better observability, faster feedback to researchers.
Tech stack
LLMPythonPyTorch
About Institute of Foundation Models
Institute of Foundation Models is hiring for the research scientist, agentic data & benchmarking role. NewJob aggregates active openings directly from Institute of Foundation Models's applicant tracking system, so this listing is current.
More jobs at Institute of Foundation Models →