About this role
About Coupang
We exist to wow our customers. We know we’re doing the right thing when we hear our customers say, “How did we ever live without Coupang?” Born out of an obsession to make shopping, eating, and living easier than ever, we’re collectively disrupting the multi-billion-dollar e-commerce industry from the ground up. We are one of the fastest-growing e-commerce companies that established an unparalleled reputation for being a dominant and reliable force in South Korean commerce.
We are proud to have the best of both worlds — a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been since our inception. We are all entrepreneurial surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day.
Our mission to build the future of commerce is real. We push the boundaries of what’s possible to solve problems and break traditional tradeoffs. Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world.
Role Overview
We are seeking a Sr. Staff Observability Engineer to lead the design and evolution of our observability platform for a GPU-as-a-Service (GPUaaS) infrastructure. This role will own the end-to-end telemetry strategy—from high-throughput metric ingestion to log pipelines and real-time visualization—powering deep insights into GPU clusters, datacenter systems, and distributed workloads.
You will architect and operate planet-scale telemetry pipelines leveraging Grafana Alloy, Mimir, Loki, and Vector , ensuring high-fidelity observability across GPU workloads, Kubernetes clusters, and datacenter infrastructure.
Key Responsibilities
<Architectural Leadership & Strategy>
• End-to-End Observability Platform Ownership : Design and scale telemetry pipelines using:
• Grafana Alloy for metrics collection (Prometheus-compatible pipelines)
• Datadog Vector for high-throughput log ingestion and transformation
• Grafana Mimir for scalable time-series storage
• Grafana Loki for log aggregation and querying
• Strategic Roadmap : Define the multi-year vision for GPU infrastructure observability, transitioning from reactive monitoring to SLO-driven, predictive, and automated observability .
• High-Cardinality Telemetry Design : Optimize pipelines for GPU workloads characterized by:
• High-cardinality labels (GPU IDs, tenants, workloads)
• Burst-heavy workloads (ML training, inference spikes)
• Multi-tenant isolation requirements
<Telemetry Pipeline Engineering>
• Architect low-latency, high-throughput pipelines capable of ingesting:
• GPU metrics (utilization, memory, thermals, MIG partitions)
• Kubernetes and container telemetry
• Distributed system logs and traces
• Build and optimize metric pipelines (Alloy → Mimir) ensuring:
• Efficient remote_write tuning
• Cost-effective retention strategies
• Horizontal scalability and compaction tuning
• Design log pipelines (Vector → Loki) with:
• Structured logging and enrichment
• Intelligent filtering/sampling
• Stream partitioning for high-ingest environments
<GPU & Infrastructure Observability>
• Establish deep observability into:
• GPU hardware (NVIDIA DCGM, MIG, NVLink, PCIe)
• Kubernetes GPU operators and scheduling behavior
• Network fabric (RDMA, InfiniBand, TCP performance)
• Define GPU-specific SLIs/SLOs such as:
• GPU utilization efficiency
• Job scheduling latency
• Cluster fragmentation
• Thermal and power anomalies
<Visualization & User Experience>
• Build rich Grafana dashboards for:
• Real-time GPU fleet health
• Tenant-level usage and billing insights
• Capacity planning and forecasting
• Standardize dashboard frameworks and reusable panels across teams
• Enable self-service observability for platform and ML engineering teams
<SRE, Automation & Reliability>
• Drive adoption of SRE principles :
• SLIs, SLOs, error budgets tailored to GPU workloads
• Integrate observability into CI/CD and IaC pipelines (Terraform/Kubernetes) :
• Automated canary analysis
• Observability-driven rollbacks
• Build automation (Go/Python) for:
• Pipeline health monitoring
• Dynamic routing and scaling of telemetry workloads
<Incident Forensics & Debugging>
• Develop tooling and practices for cross-layer correlation :
• GPU → Node → Kubernetes → Application → Network
• Lead deep RCA efforts for:
• GPU contention issues
• Performance degradation in ML workloads
• Telemetry pipeline backpressure/failures
• Enable “needle-in-a-haystack” debugging using unified logs + metrics
<Technical Leadership & Collaboration>
• Mentor engineers and lead design reviews for observability systems
• Act as a force multiplier across SRE, Infra, and ML platform teams
• Promote Observability-by-Design in all new GPU cluster deployments
<Open Source & Ecosystem Strategy>
• Drive adoption and contribution to:
• Grafana stack (Alloy, Mimir, Loki, Tempo)
• OpenTelemetry ecosystem
• Define build vs. buy decisions (Datadog vs OSS vs hybrid approaches)
• Optimize interoperability between Vector and OTEL pipelines
<Security & Compliance>
• Architect secure telemetry pipelines with:
• Encryption in transit and at rest
• Multi-tenant isolation and RBAC
• Data residency compliance
• Implement Zero Trust observability patterns
Qualifications & Requirements
• BS/MS in Computer Science or equivalent practical experience
• Extensive experience in Observability, SRE, or Distributed Infrastructure
• Proven track record building large-scale telemetry pipelines (metrics/logs)
• Observability Stack :
• Grafana Alloy / Prometheus ecosystem
• Grafana Mimir (or Cortex/Thanos)
• Grafana Loki
• Datadog Vector (or similar log pipelines)
• Programming :
• Strong in Go or Python
• Data Systems :
• TSDBs and log storage at scale
• Infrastructure :
• Kubernetes, Linux internals
• GPU systems (NVIDIA DCGM, CUDA ecosystem)
• High-performance networking (RDMA, InfiniBand preferred)
• Cloud & Hybrid:
• Experience building observability across:
• Bare-metal GPU clusters
• Hybrid cloud environments
Core Impact
Success in this role is measured by:
• A highly reliable, scalable observability platform powering GPU infrastructure
• Ability to diagnose complex GPU and distributed system issues in minutes
• Enabling data-driven optimization of GPU utilization and cost efficiency
• Building systems that proactively detect and mitigate failures before user impact
Recruitment Process and Others
Recruitment Process
• Application Review - 1st Virtual Interview - 2nd Virtual Interview - Offer
• The exact nature of the recruitment process may vary according to the specific job and may be changed due to scheduling or other circumstances.
• Interview schedules and the results will be informed to the applicant via the e-mail address submitted at the application stage.
Details to Consider
• This job posting may be closed prior to the stated end date for application if all openings are filled.
• Coupang has the right to rescind an offer of employment if a candidate is found to have submitted false information as part of the application process.
• Those eligible for employment protection (recipients of veteran’s benefits, the disabled, etc.) may receive preferential treatment for employment in accordance with applicable laws.
• We are proud to offer equal opportunities for all applicants.
• Hiring may be restricted in case the legal qualifications required for hiring and work performance is not met.
• This is a full-time regular position and includes 12 weeks of probation period; provided, however, the probationary period may be either skipped, shortened or extended if necessary for business purposes.
Privacy Notice
• Your personal information will be collected and managed by Coupang as stated in the Application Privacy Notice is located below.
• https://privacy.coupang.com/en/land/jobs/
Document Return Policy
• This notification is given pursuant to Article 11 (6) of the Fair Hiring Procedure Act.
• A job applicant, who has applied but not been finally selected for a position at Coupang (the “Company”), may request the Company to return his/her hiring documents submitted pursuant to the Fair Hiring Procedure Act. However, this will not apply where the hiring documents were submitted via the website of the Company or e-mail, or where the job applicant submitted those documents voluntarily without a request from the Company. In addition, if the hiring documents were destroyed due to a natural disaster or any other reasons not attributable to the Company, such documents will be deemed to have been returned to the job applicant.
• A job applicant who wishes to request the return of his/her hiring documents pursuant to the main sentence of paragraph 2 above should fill out a “Request for Return of Hiring Documents” [Annex Form No. 3 in the Enforcement Rule of the Fair Hiring Procedure Act] and submit It by email ([email protected]). In such case, within fourteen (14) days from the date of identifying the receipt of the request, the Company will send the hiring documents to the job applicant’s designated address via registered mail. Please be informed that the job applicant is required to pay the postage on the registered mail.
• In preparation for a job applicant’s request for the return of hiring documents pursuant to the main sentence of paragraph 2 above, the Company shall retain the original hiring documents submitted by the job applicant for 180 days from the completion of the recruiting process. If no request is made until the end of this period, all his/her hiring documents will be destroyed immediately in accordance with the Personal Information Protection Act.
• The above paragraphs 1 - 4 shall only apply when the labor-related laws of Korea govern the application. They are otherwise not applicable.
We exist to wow our customers. We know we’re doing the right thing when we hear our customers say, “How did we ever live without Coupang?” Born out of an obsession to make shopping, eating, and living easier than ever, we’re collectively disrupting the multi-billion-dollar e-commerce industry from the ground up. We are one of the fastest-growing e-commerce companies that established an unparalleled reputation for being a dominant and reliable force in South Korean commerce.
We are proud to have the best of both worlds — a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been since our inception. We are all entrepreneurial surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day.
Our mission to build the future of commerce is real. We push the boundaries of what’s possible to solve problems and break traditional tradeoffs. Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world.
Role Overview
We are seeking a Sr. Staff Observability Engineer to lead the design and evolution of our observability platform for a GPU-as-a-Service (GPUaaS) infrastructure. This role will own the end-to-end telemetry strategy—from high-throughput metric ingestion to log pipelines and real-time visualization—powering deep insights into GPU clusters, datacenter systems, and distributed workloads.
You will architect and operate planet-scale telemetry pipelines leveraging Grafana Alloy, Mimir, Loki, and Vector , ensuring high-fidelity observability across GPU workloads, Kubernetes clusters, and datacenter infrastructure.
Key Responsibilities
<Architectural Leadership & Strategy>
• End-to-End Observability Platform Ownership : Design and scale telemetry pipelines using:
• Grafana Alloy for metrics collection (Prometheus-compatible pipelines)
• Datadog Vector for high-throughput log ingestion and transformation
• Grafana Mimir for scalable time-series storage
• Grafana Loki for log aggregation and querying
• Strategic Roadmap : Define the multi-year vision for GPU infrastructure observability, transitioning from reactive monitoring to SLO-driven, predictive, and automated observability .
• High-Cardinality Telemetry Design : Optimize pipelines for GPU workloads characterized by:
• High-cardinality labels (GPU IDs, tenants, workloads)
• Burst-heavy workloads (ML training, inference spikes)
• Multi-tenant isolation requirements
<Telemetry Pipeline Engineering>
• Architect low-latency, high-throughput pipelines capable of ingesting:
• GPU metrics (utilization, memory, thermals, MIG partitions)
• Kubernetes and container telemetry
• Distributed system logs and traces
• Build and optimize metric pipelines (Alloy → Mimir) ensuring:
• Efficient remote_write tuning
• Cost-effective retention strategies
• Horizontal scalability and compaction tuning
• Design log pipelines (Vector → Loki) with:
• Structured logging and enrichment
• Intelligent filtering/sampling
• Stream partitioning for high-ingest environments
<GPU & Infrastructure Observability>
• Establish deep observability into:
• GPU hardware (NVIDIA DCGM, MIG, NVLink, PCIe)
• Kubernetes GPU operators and scheduling behavior
• Network fabric (RDMA, InfiniBand, TCP performance)
• Define GPU-specific SLIs/SLOs such as:
• GPU utilization efficiency
• Job scheduling latency
• Cluster fragmentation
• Thermal and power anomalies
<Visualization & User Experience>
• Build rich Grafana dashboards for:
• Real-time GPU fleet health
• Tenant-level usage and billing insights
• Capacity planning and forecasting
• Standardize dashboard frameworks and reusable panels across teams
• Enable self-service observability for platform and ML engineering teams
<SRE, Automation & Reliability>
• Drive adoption of SRE principles :
• SLIs, SLOs, error budgets tailored to GPU workloads
• Integrate observability into CI/CD and IaC pipelines (Terraform/Kubernetes) :
• Automated canary analysis
• Observability-driven rollbacks
• Build automation (Go/Python) for:
• Pipeline health monitoring
• Dynamic routing and scaling of telemetry workloads
<Incident Forensics & Debugging>
• Develop tooling and practices for cross-layer correlation :
• GPU → Node → Kubernetes → Application → Network
• Lead deep RCA efforts for:
• GPU contention issues
• Performance degradation in ML workloads
• Telemetry pipeline backpressure/failures
• Enable “needle-in-a-haystack” debugging using unified logs + metrics
<Technical Leadership & Collaboration>
• Mentor engineers and lead design reviews for observability systems
• Act as a force multiplier across SRE, Infra, and ML platform teams
• Promote Observability-by-Design in all new GPU cluster deployments
<Open Source & Ecosystem Strategy>
• Drive adoption and contribution to:
• Grafana stack (Alloy, Mimir, Loki, Tempo)
• OpenTelemetry ecosystem
• Define build vs. buy decisions (Datadog vs OSS vs hybrid approaches)
• Optimize interoperability between Vector and OTEL pipelines
<Security & Compliance>
• Architect secure telemetry pipelines with:
• Encryption in transit and at rest
• Multi-tenant isolation and RBAC
• Data residency compliance
• Implement Zero Trust observability patterns
Qualifications & Requirements
• BS/MS in Computer Science or equivalent practical experience
• Extensive experience in Observability, SRE, or Distributed Infrastructure
• Proven track record building large-scale telemetry pipelines (metrics/logs)
• Observability Stack :
• Grafana Alloy / Prometheus ecosystem
• Grafana Mimir (or Cortex/Thanos)
• Grafana Loki
• Datadog Vector (or similar log pipelines)
• Programming :
• Strong in Go or Python
• Data Systems :
• TSDBs and log storage at scale
• Infrastructure :
• Kubernetes, Linux internals
• GPU systems (NVIDIA DCGM, CUDA ecosystem)
• High-performance networking (RDMA, InfiniBand preferred)
• Cloud & Hybrid:
• Experience building observability across:
• Bare-metal GPU clusters
• Hybrid cloud environments
Core Impact
Success in this role is measured by:
• A highly reliable, scalable observability platform powering GPU infrastructure
• Ability to diagnose complex GPU and distributed system issues in minutes
• Enabling data-driven optimization of GPU utilization and cost efficiency
• Building systems that proactively detect and mitigate failures before user impact
Recruitment Process and Others
Recruitment Process
• Application Review - 1st Virtual Interview - 2nd Virtual Interview - Offer
• The exact nature of the recruitment process may vary according to the specific job and may be changed due to scheduling or other circumstances.
• Interview schedules and the results will be informed to the applicant via the e-mail address submitted at the application stage.
Details to Consider
• This job posting may be closed prior to the stated end date for application if all openings are filled.
• Coupang has the right to rescind an offer of employment if a candidate is found to have submitted false information as part of the application process.
• Those eligible for employment protection (recipients of veteran’s benefits, the disabled, etc.) may receive preferential treatment for employment in accordance with applicable laws.
• We are proud to offer equal opportunities for all applicants.
• Hiring may be restricted in case the legal qualifications required for hiring and work performance is not met.
• This is a full-time regular position and includes 12 weeks of probation period; provided, however, the probationary period may be either skipped, shortened or extended if necessary for business purposes.
Privacy Notice
• Your personal information will be collected and managed by Coupang as stated in the Application Privacy Notice is located below.
• https://privacy.coupang.com/en/land/jobs/
Document Return Policy
• This notification is given pursuant to Article 11 (6) of the Fair Hiring Procedure Act.
• A job applicant, who has applied but not been finally selected for a position at Coupang (the “Company”), may request the Company to return his/her hiring documents submitted pursuant to the Fair Hiring Procedure Act. However, this will not apply where the hiring documents were submitted via the website of the Company or e-mail, or where the job applicant submitted those documents voluntarily without a request from the Company. In addition, if the hiring documents were destroyed due to a natural disaster or any other reasons not attributable to the Company, such documents will be deemed to have been returned to the job applicant.
• A job applicant who wishes to request the return of his/her hiring documents pursuant to the main sentence of paragraph 2 above should fill out a “Request for Return of Hiring Documents” [Annex Form No. 3 in the Enforcement Rule of the Fair Hiring Procedure Act] and submit It by email ([email protected]). In such case, within fourteen (14) days from the date of identifying the receipt of the request, the Company will send the hiring documents to the job applicant’s designated address via registered mail. Please be informed that the job applicant is required to pay the postage on the registered mail.
• In preparation for a job applicant’s request for the return of hiring documents pursuant to the main sentence of paragraph 2 above, the Company shall retain the original hiring documents submitted by the job applicant for 180 days from the completion of the recruiting process. If no request is made until the end of this period, all his/her hiring documents will be destroyed immediately in accordance with the Personal Information Protection Act.
• The above paragraphs 1 - 4 shall only apply when the labor-related laws of Korea govern the application. They are otherwise not applicable.
Tech stack
KubernetesTerraformPython
About Coupang
Coupang is hiring for the sr. staff observability engineer role. NewJob aggregates active openings directly from Coupang's applicant tracking system, so this listing is current.
More jobs at Coupang →