Salary

$200K–$250K

Type

Full-time

Experience

5+ yr

Source

Ashby

About this role

Agency Notice: We are not currently working with recruiting agencies for this role. Please do not contact Vizcom employees regarding this position. Any resumes submitted without a prior agreement will be considered unsolicited.

About Vizcom

Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. Our stack includes React/TypeScript frontend, Node/Koa + PostGraphile API services, PostgreSQL, Redis, BullMQ queues, and Kubernetes-based production infrastructure.

We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale.

Role Mission

Own service reliability end-to-end: prevent incidents, reduce blast radius when failures happen, and lead fast, high-quality recovery when production degrades.

This is a hands-on technical leadership role with authority to set reliability standards and enforce production guardrails.

Compensation

$200,000 – $250,000 base salary + meaningful equity

What You’ll Own

- Reliability bar: Set and enforce SLIs/SLOs/error budgets for critical user flows.

- Production architecture resilience: Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access.

- Kubernetes runtime reliability: Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety.

- Queue + job safety (BullMQ/Redis): Own poison pill containment and workload isolation.

- Incident command quality: Lead Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution).

- Reliability operating system: Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline.

- Release safety authority: Gate risky deploys and enforce reliability guardrails when production health is at risk.

Traits We’re Looking For

- Calm, structured incident commander under pressure.

- Thinks in failure modes and blast radius by default.

- Pragmatic: can stabilize quickly, then implement durable fixes.

- High ownership and strong written communication.

FIRST 90 DAYS

- Establish baseline reliability metrics and identify top platform risks.

- Tighten incident response mechanics (roles, comms cadence, runbooks, status updates).

- Deliver high-impact hardening fixes across probes/startup paths/queue safety.

- Publish a prioritized 6–12 month reliability roadmap with clear ownership and milestones.

If possible please include one incident you personally led and send to [email protected] :

1) what failed,

2) how you contained it,

3) what permanent fixes you shipped, and measured.

Tech stack

ReactTypeScriptPostgreSQLRedisKubernetes

About Vizcom Technologies

Vizcom Technologies is hiring for the senior platform & reliability engineer role. NewJob aggregates active openings directly from Vizcom Technologies's applicant tracking system, so this listing is current. More jobs at Vizcom Technologies →

Senior Platform & Reliability Engineer