Type

Full-time

Experience

8+ yr

About this role

At Shakudo, we're building the world's first operating system for data and AI. We use the term "operating system" in the truest sense: just like iOS, Windows, or Linux, Shakudo's end-to-end OS provides ever-evolving, fully automated, best-in-class open-source components tailored to each business's unique needs.

We are seeking an Infrastructure Engineer to join our Business Automation team to own and operate the internal systems, infrastructure, and AI Gateway product that power Shakudo at scale. This is a hands-on role for someone who thrives on keeping production systems reliable, secure, and fast. You will be responsible for everything from physical servers and DGX machines to CI/CD pipelines and customer-facing AI Gateway infrastructure. You will also contribute directly to product hardening, security, and DevOps practices across the platform.

At Shakudo, our culture is proactive, collaborative, and supportive — we succeed together by building strong partnerships and solving complex challenges. We expect high ownership: you will be hands-on, driving outcomes directly rather than delegating or waiting for direction. Individual contribution matters here — your work will have a visible, measurable impact on the company's operations and product.

Maintain and operate internal services for the rest of the Shakudo employees, including proprietary applications for sales and ETL pipelines

Maintain and operate DGX machines that host LLMs for the team's use

Maintain and operate Shakudo's product for Shakudo's internal use, and contribute to product hardening, security, and DevOps practices

Maintain and operate physical servers for Kubernetes clusters and ensure uptime

Create CI/CD pipelines for internal deployments

Maintain and operate the AI Gateway product for customers, ensure uptime, and contribute to product roadmap
8+ years of experience across software, data, platform, or AI engineering roles

5+ years of strong experience with Kubernetes cluster operation and DevOps, and bare-metal server operations

Experience operating production infrastructure at scale, including physical servers, GPU clusters, and CI/CD systems

Strong background in security hardening, observability, and reliability engineering

Proficiency in Rust is preferred

Experience with AI/ML infrastructure, including LLM hosting and inference serving is preferred

Work with cutting-edge technologies in machine learning and high-performance computing. Contribute to a platform that transforms how organizations leverage data and AI. Join a dynamic team that values innovation, efficiency, and diversity.

Shakudo offers a high-impact package: competitive salary, meaningful equity so you share in the upside of transformational technology, and comprehensive health benefits that have you fully covered. We provide a flexible vacation policy—because building transformational technology requires supporting the people who build it. More importantly, you'll work on technology that matters.

This role is based onsite in Toronto to support the high security requirements of our clients and enable effective collaboration. We have a welcoming office environment with a very focused and passionate team, doing meaningful, impactful work together.

Shakudo is an equal opportunity employer and encourages candidates of all backgrounds to apply. We foster diversity and inclusivity and welcome applications from a broad range of backgrounds and experiences.

Tech stack

KubernetesRustLLM

More at Shakudo

Forward Deployed Engineer

Menlo Park, CA

Engineering

Senior Engagement Manager

Sales See all roles at Shakudo →