Site Reliability Engineer II, SRE
Microsoft
Site Reliability Engineer II, SRE
Redmond, Washington, United States
Save
Overview
What is a Software Reliability Engineer (SRE)? SRE is what you get when you treat operations as if it is a software engineering problem. Our mission is to improve the availability, latency, performance, and security of the Microsoft Teams services. Like traditional operations, we keep important revenue-critical systems up and running, even when natural disasters, bandwidth outages and configuration problems occur. Unlike traditional operations groups, we identify and address these software problems directly through software improvements, innovative technologies, and systems automation.
As a Site Reliability Engineer II in Teams, you will provide leadership, direction and accountability for networking, infrastructure design, end to end implementation and security for Teams services. Proficient collaboration skills will be required working closely with other engineering teams to ensure services/systems are highly stable and performant and meet the expectations of internal stakeholders and external customers and users. This opportunity will allow you to learn what it takes to deploy and run software as a 24x7 enterprise grade cloud service, hone your security expertise and become an expert in webservices optimization.
Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Qualifications
Required Qualifications:
Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration
- OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
- Fundamental understanding of TCP/IP concepts, load balancing, CDN, ACL, routing, TLS. IP network analysis and performance and application issues using standard tools.
- Fundamental understanding of security practices for native applications, web applications, distributed and database systems.
- Understanding of security issues for large scale cloud services and network infrastructures.
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Preferred Qualifications:
- 2+ years technical experience running large-scale service on Linux.
- 3+ years experience in scripting languages such as bash, python, and PowerShell, or compiled languages such as C#.
- Demonstrated solid working knowledge on cloud computing / Azure / AAD.
- Experience with with Docker and Kubernetes.
Site Reliability Engineering IC3 - The typical base pay range for this role across the U.S. is USD $100,600 - $199,000 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $131,400 - $215,400 per year.
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay
Microsoft will accept applications for the role until September 5,2025
#sre #teams #microsoftteams #security #sitereliability #production #network
Responsibilities
- Design, write and deliver software to improve the availability, scalability, latency, and efficiency of Microsoft’s Identity services.
- Help define the next generation of Teams services infrastructure and routing design and drive its implementation.
- Troubleshoot complex infrastructure and network issues and proactively implement methods to reduce reoccurrence and impact of future incidents.
- Develop code, scripts, systems, or platforms that automate complex operations processes (e.g., monitoring, alerting, routing, debugging) at scale.
- Identify security issues and recommends potential mitigation strategies to address underlying causes.
- Develops security guidance and models to address issues and to contribute to the definition of best practices.
- Suggest and drives appropriate guidance, models, response, and remediation for issues.
- Participate in regular on-call rotations and share details related to incidents and their resolution through post-mortem reports and regular review meetings.