Senior Site Reliability Engineer (SRE)
📍 Singapore
About the Role
We are partnering with a globally recognised digital entertainment and streaming platform that serves millions of users worldwide across web, mobile, and connected devices. The business operates at massive scale across multiple regions and is known for delivering highly available, secure, and performant streaming experiences to a global audience.
As part of their continued growth, they are looking for a Senior Site Reliability Engineer (SRE) to join their Singapore team. This position will play a key role in building scalable infrastructure, improving platform reliability, driving automation, and enabling engineering teams to operate efficiently at scale.
You will work within a high-performing engineering environment focused on cloud-native technologies, distributed systems, reliability engineering, observability, security, and developer productivity.
Key Responsibilities
- Collaborate closely with software engineering teams to design and build scalable platform and infrastructure solutions.
- Drive and champion SRE best practices across engineering teams.
- Lead Incident Management processes, Capacity Planning initiatives, and Service Level Objective (SLO) implementations.
- Automate operational and infrastructure workflows using Infrastructure-as-Code principles.
- Build internal tools, automation frameworks, and AI-driven engineering solutions to improve developer productivity and reduce operational toil.
- Continuously improve platform reliability, scalability, performance, and security.
- Define engineering standards, best practices, and operational guidelines across teams.
- Embed security and cost optimization (FinOps) practices into the software development lifecycle.
- Participate in architectural discussions and contribute to long-term platform strategy.
- Mentor junior engineers and provide technical leadership within the organization.
- Participate in on-call rotations to ensure platform reliability and service availability.
Requirements
- Bachelor’s Degree in Computer Science, Engineering, or a related discipline.
- Minimum 5 years of experience in:
- Site Reliability Engineering (SRE)
- DevOps Engineering
- Platform Engineering
- Strong experience in backend engineering and distributed systems.
- Hands-on experience with containerization and orchestration technologies such as:
- Docker
- Kubernetes
- Strong cloud infrastructure experience with at least one of the following:
- Google Cloud Platform (GCP)
- Amazon Web Services (AWS)
- Microsoft Azure
- Solid understanding of:
- Linux / Unix systems
- Networking fundamentals
- Scalability and high availability concepts
- Experience managing or supporting large-scale production systems.
- Familiarity with:
- Infrastructure as Code (IaC)
- CI/CD pipelines
- Observability and monitoring tools
- Performance engineering concepts
- Strong problem-solving skills with the ability to troubleshoot complex technical issues.
- Interest or exposure to AI/LLM tooling, automation agents, or AI-assisted engineering workflows is highly advantageous.
Preferred Technical Exposure
Exposure to technologies such as:
- Kubernetes (GKE preferred)
- Cloud-native infrastructure
- CI/CD platforms
- PostgreSQL
- Redis
- Datadog
- Spinnaker
- Cloud Build
- Multi-region deployments
- Security engineering / WAF solutions
- FinOps / Cloud cost optimization
What’s on Offer
- Opportunity to work on globally scaled systems serving millions of users.
- Exposure to modern cloud-native and distributed system architectures.
- Strong engineering culture focused on innovation, automation, and reliability.
- Regional and global collaboration opportunities.
- High-impact role with visibility across engineering teams.
- Opportunity to work with emerging AI technologies within engineering operations.
