My job alerts

Senior Site Reliability Engineer

Microsoft

Software Engineering

Hyderabad, Telangana, India

Posted on Jan 23, 2026

Overview

The Windows and Devices mission is to create innovative, trusted, and open products focused on people, showcasing Microsoft’s best and empowering everyone to achieve more. Microsoft Devices designs and manufactures premium hardware like Surface and Xbox, innovating throughout the supply and manufacturing process. Within the Windows and Devices group, Microsoft Devices Operations manages supply chains, product engineering, manufacturing, and services to deliver iconic products.

We are looking to hire a Senior Site Reliability Engineer to join our team to develop and operate next generation, world-class services supporting putting these iconic products into consumers’ hands. You will be instrumental in moving Surface and XBOX devices fresh from the factory floor, through global transit networks, and ultimately fulfill the excitement and anticipation of our customers by landing our products on their doorstep.

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Responsibilities

Independently designs, creates, tests, and deploys changes through a safe deployment process (SDP) to enhance code quality and improve the observability, security, reliability and operability of platforms, systems, and products at scale. Reviews the effect of these changes to produce artifacts that results in insights for their team and across the company. Provides design and quality mentorship to less experienced engineers.
Leverages technical expertise in the infrastructure of cloud technologies and specific products, as well as objective insights drawn from analyses of production telemetry data to advocate for, or directly contribute to the automation to improve the availability, security, quality, observability, reliability, efficiency, observability, and performance of related sets of products developed and supported by teams within an organization.
Leverages end-to-end technical expertise and telemetry analysis alongside advanced artificial intelligence (AI) and machine learning (ML) algorithms to identify patterns and opportunities to implement configuration and data changes for related sets of platforms, systems, or products in production using code, tooling, and automation. Identifies cases where teams lack the tools and/or capability to manage platforms, systems, or products using code and drives efforts within an organization to expand capabilities and/or tooling accordingly, integrating modern solutions to enhance operational efficiency and predictive maintenance.
Shares insights and best practices via documented artifacts that can be applied to improve development and operations across related sets of systems, platforms, and/or products. Continues to develop their understanding of insights and best practices through interactions with more experienced SREs and members of product engineering teams. Mentors and coaches less experienced engineers to help them identify and propose relevant solutions.
Writes code, scripts, systems, and/or artificial intelligence (AI)/machine learning (ML) platforms to automate operations tasks (e.g., monitoring, alerting, deploying products and updates, debugging) at scale. Assesses current automation code and scripts to evaluate reusability, extendibility, and scalability within an the organization.
Develops, maintains, and implements capacity planning models and monitoring tools to forecast product capacity, related security risk, and resource demands. Models the predicted effect of changes to capacity plans to optimize code bases to better manage resources in response to dynamic capacity demands and threats. Contributes to the development of automated resource utilization tools, or processes, and/or artificial intelligence (AI) and machine learning (ML) algorithms that can dynamically scale compute resources up or down to adjust to capacity demands or threat profiles.
Handles incidents during on-call shifts assessing impact, troubleshooting complex problems, taking appropriate action to mitigate impact, and heading investigations to address root cause(s). Notifies product teams, owners, and leadership about issues with major customer/business impact, escalating complex, and ambiguous problems to other engineering teams or experts as needed. Drives incident mitigation efforts during a crisis. Provides command and control. Communicates incident details and resolutions through post-mortem reports and review meetings. Conducts post-incident reviews without blame to promote a culture of inclusion, learning, and continuous improvement. Engages with customers post-crisis on confidence calls, drives service improvement plans to restore trust.
Leverages existing tools and automation, including the safe deployment process (SDP), to enable product engineering teams within their organization to increase the velocity in which they can reliably and safely implement changes in production. Monitors the effects of changes across platforms or systems.
Draws insights from performance and resource monitoring across products and services within their organization to identify whether there is a need to optimize algorithms, security, infrastructure, or architecture - or if changes to compute resources are required. Leverages artificial intelligence (AI) and machine learning (ML) algorithms to analyze vast amounts of data quickly and accurately. Uses advanced models, incorporating modern techniques, to forecast and verify the efficacy of changes at scale, considers the evolving security risk landscape adopting appropriate mitigations to reduce risk, and proposes solutions that are aligned with customer/business needs. Reviews the system as a whole and proposes changes and leads implementation of solutions to identified performance and resource challenges. Continually monitors systems performance and looks for opportunities for enhancements to drive Cost of Goods Sold (COGS) reductions.
Analyzes data from telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of systems, platforms, or products operating at scale. Contributes to the development of new tooling and/or predictive models to identify and test potential improvements in product development and/or operations, and monitors the impact of changes on operations metrics (e.g., Time-to-X) within an organization. Analyze signals from customer crises and drive product quality improvements
Develops end-to-end technical expertise in the architecture, code, features, and operations of specific products as required to implement improvements in product availability, security, quality, observability, reliability, efficiency, observability, and/or performance. Drives code/design reviews with the engineering teams that develop and/or manage those products and shares learnings and recommendations across engineering teams working on related products within their organization and other organizations as relevant.
Demonstrates end-to-end expertise in distributed systems design, interactions between cloud technology layers and components, functions of physical network devices, and dependencies at scale. Drives efforts within an organization to identify and recommend optimal configurations of cloud technology solutions and develops or modifies the code base that defines infrastructures to improve the security, quality, reliability, and operability of supported products.
Researches and maintains deep knowledge of industry trends as well as advances in cloud technologies. Identifies opportunities to create, implement, and/or optimally utilize new tools, technologies, artificial intelligence (AI)/machine learning (ML), and/or processes to solve ambiguous problems and improve product availability, security, quality, observability, reliability, efficiency, observability, and/or performance. Drives the adoption of new solutions across engineering teams working with related products within an organization and provides guidance and coaching to others on relevant topics.

Qualifications

Required Qualifications:

Master's Degree in Computer Science, Information Technology, or related field AND 6+ years technical experience in software engineering, network engineering, or systems administration
- OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 8+ years technical experience in software engineering, network engineering, or systems administration
- OR equivalent experience.
3+ years technical experience working with large-scale cloud or distributed systems.

Other Requirements: Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings:

Microsoft Cloud Background Check: This position will be required to pass the
Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Preferred Qualifications:

Experience running and operating online live site services, including DRI rotation and incident management

Experience using AI tools to rapidly analyze large volumes of service telemetry
Doctorate Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
- OR Master's Degree in Computer Science, Information Technology, or related field AND 6+ years technical experience in software engineering, network engineering, or systems administration
- OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 8+ years technical experience in software engineering, network engineering, or systems administration
- OR equivalent experience.

#W+DJOBS

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.

See more open positions at Microsoft

Find Your Dream Job Today

Senior Site Reliability Engineer