TRAFFIC Not Specified

AI Infra Site Reliability Engineer - AI Infrastructure

Hamilton Barnes Associates Limited

Job Description

Ready to architect AI infrastructure that powers next-generation research and cloud platforms?

Join a seed-stage AI infrastructure company building large-scale training and inference platforms previously accessible only to hyperscalers. The business began with a single managed GPU cluster that reached capacity almost immediately and has since expanded into a global platform spanning infrastructure, networking, and orchestration.

Build resilient, scalable AI platforms that empower startups and innovation. Apply today!

Key Responsibilities
  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.
Requirements
  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong hands on experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with high performance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.
Benefits
  • IPO Equity
Salary
  • $200,000 gross per year

About This Role

Career insights for Software Developers positions

Salary Benchmark
$132,270/year
Source: O*NET (USD)
Job Outlook
This career will grow rapidly in the next few years and will have large numbers of openings.
Common Technologies
Apache Kafka Apache Maven Jakarta EE Airtable Apache Hive Blackboard Learn Apache Spark jQuery

Job Overview

Date Posted
28 Mar 2026
Location
Not Specified, Singapore

Software Developers Insights

Job Outlook
This career will grow rapidly in the next few years and will have large numbers of openings.

Similar Opportunities

This page incorporates data from O_NET OnLine, courtesy of the U.S. Department of Labor, Employment and Training Administration (USDOL/ETA), under the CC BY 4.0 license. O_NET is a registered trademark of USDOL/ETA. Assessify has adapted and modified the original content. Please note that USDOL/ETA has neither reviewed nor endorsed these changes.