Senior Site Reliability Engineer
Microsoft
Senior Site Reliability Engineer
Redmond, Washington, United States
Save
Overview
Halo Studios is building the future of the blockbuster Halo series of video games with Unreal Engine. As part of Xbox Game Studios, the Halo franchise encompasses games, novels, comics, licensed collectibles, apparel, and more with a shared vision of heroism, mystery, and wonder. With multiple projects in development, join our team as we forge the next generation of games and experiences in our award-winning sci-fi universe.
You will be a key member of the Halo Studios IT Engineering team, responsible for studio data and IT services and infrastructure within our studio. You will be contributing to the architecture and design of new on-prem and cloud infrastructure, while continuing to drive optimization, performance, security, and reliability with cutting edge technologies and automation. You will empower artists, developers, and others in our studio by proactively designing technical solutions to maximize their efficiency. Along the way, you will be a trusted voice who shares your knowledge and expertise within our team and other teams in the studio. You will be joining a fast-paced team that constantly provides new opportunities to learn and grow. Roles at our studio are flexible, and you can work from home up to two days a week in this role.
Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Qualifications
Qualifications
Required qualifications:
- 6+ years technical experience in software engineering, network engineering, or systems administration
- OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
- OR Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration.
- 5+ years of experience managing technical infrastructure, including 3+ years of hands-on Linux system administration involving troubleshooting, performance tuning, security configuration, and automation of core OS services.
- 3+ years of experience building and maintaining infrastructure automation using scripting languages (e.g., PowerShell, Python, Bash) and infrastructure-as-code tools (e.g., Docker, Kubernetes, Terraform, Azure Bicep), with a focus on deploying and managing containerized applications and services.
- 5+ years of experience owning and operating production-grade infrastructure systems at scale, including responsibility for reliability, performance tuning, monitoring/observability, and incident response across hybrid or cloud-native environments?
Other Requirements:
- Ability to meet Microsoft, customer and/or government security screening requirements is required for this role. These requirements include, but are not limited to, the following specialized security screenings:
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Preferred qualifications:
- Experience with troubleshooting and designing solutions with core infrastructure technologies including on-prem and/or Azure networking, cloud technologies, Active Directory / Entra ID
- Experience maintaining high-availability version control systems (e.g., Perforce) and CI/CD build infrastructure
- Experience with infrastructure observability, incident response, and capacity planning for cloud and hybrid systems
- Experience with Entra ID authentication (oauth2, OIDC, SAML) for Azure Resources and App Registrations
Site Reliability Engineering IC4 - The typical base pay range for this role across the U.S. is USD $119,800 - $234,700 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $158,400 - $258,000 per year.
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay
Microsoft will accept applications for the role until June 26, Year.
#gamingjobs #halo #halojobs
Responsibilities
Responsibilities
- Architect, implement, and optimize critical hybrid and cloud-based IT infrastructure utilizing Infrastructure as Code (IaC) technologies (e.g., Docker, Terraform, AKS) to ensure high availability, scalability, security, and operational efficiency.
- Design, scale, and maintain Perforce, Swarm, and build farm infrastructure used by game development teams and automated build environments, to ensure robust, high-performance workflows across distributed game development environments.
- Design and implement Azure Networking solutions including Site-to-Site tunnels, App Gateway, Private Endpoints/Private Link, DNS, and network security for Azure resources, ensuring secure and reliable connectivity.
- Architect and deliver automation solutions to improve service health, manageability, reliability, telemetry, and alerting.
- Implement data governance, storage, backup, and disaster recovery solutions for a multi-Petabyte Azure-based environment, ensuring data integrity, security, and performance.
- Research, evaluate, and integrate emerging tools and methodologies into the technology roadmap, to continuously optimize efficiency, reliability, and scalability.
- Produce and maintain clear and accurate technical documentation and design specifications that align with best practices.
- Collaborate with software engineers, project management, and operations teams to improve and optimize infrastructure and evolve services, ensuring alignment with organizational goals.
- Participate in on-call rotations, lead incident response, and conduct postmortems to identify root causes and implement preventative infrastructure improvements.