At Shakudo, we're building the world's first operating system for data and AI. We use the term "operating system" in the truest sense: just like iOS, Windows, or Linux, Shakudo's end-to-end OS provides ever-evolving, fully automated, best-in-class open-source components tailored to each business's unique needs.
We are seeking an Infrastructure Engineer to join our Business Automation team to own and operate the internal systems, infrastructure, and AI Gateway product that power Shakudo at scale. This is a hands-on role for someone who thrives on keeping production systems reliable, secure, and fast. You will be responsible for everything from physical servers and DGX machines to CI/CD pipelines and customer-facing AI Gateway infrastructure. You will also contribute directly to product hardening, security, and DevOps practices across the platform.
At Shakudo, our culture is proactive, collaborative, and supportive — we succeed together by building strong partnerships and solving complex challenges. We expect high ownership: you will be hands-on, driving outcomes directly rather than delegating or waiting for direction. Individual contribution matters here — your work will have a visible, measurable impact on the company's operations and product.
- Maintain and operate internal services for the rest of the Shakudo employees, including proprietary applications for sales and ETL pipelines
- Maintain and operate DGX machines that host LLMs for the team's use
- Maintain and operate Shakudo's product for Shakudo's internal use, and contribute to product hardening, security, and DevOps practices
- Maintain and operate physical servers for Kubernetes clusters and ensure uptime
- Create CI/CD pipelines for internal deployments
- Maintain and operate the AI Gateway product for customers, ensure uptime, and contribute to product roadmap
- 8+ years of experience across software, data, platform, or AI engineering roles
- 5+ years of strong experience with Kubernetes cluster operation and DevOps, and bare-metal server operations
- Experience operating production infrastructure at scale, including physical servers, GPU clusters, and CI/CD systems
- Strong background in security hardening, observability, and reliability engineering
- Proficiency in Rust is preferred
- Experience with AI/ML infrastructure, including LLM hosting and inference serving is preferred
Work with cutting-edge technologies in machine learning and high-performance computing. Contribute to a platform that transforms how organizations leverage data and AI. Join a dynamic team that values innovation, efficiency, and diversity.
Shakudo offers a high-impact package: competitive salary, meaningful equity so you share in the upside of transformational technology, and comprehensive health benefits that have you fully covered. We provide a flexible vacation policy—because building transformational technology requires supporting the people who build it. More importantly, you'll work on technology that matters.
This role is based onsite in Toronto to support the high security requirements of our clients and enable effective collaboration. We have a welcoming office environment with a very focused and passionate team, doing meaningful, impactful work together.
Shakudo is an equal opportunity employer and encourages candidates of all backgrounds to apply. We foster diversity and inclusivity and welcome applications from a broad range of backgrounds and experiences.