Site Reliability Engineer III
JPMorganChase
Software Engineering, IT
Hyderabad, Telangana, India
Posted on Apr 15, 2026
There’s nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems.
As a Site Reliability Engineer III at JPMorgan Chase within Employee Platforms, you will solve complex and broad business problems with simple and straightforward solutions.
Job responsibilities
- Own L1/L2 production support, participate in on‑call rotations, and drive rapid triage, containment, and recovery for incidents in.
- Lead post‑incident reviews and implement preventative actions to eliminate repeat issues and reduce operational risk.
- Define and maintain SLIs/SLOs and error budgets for critical user journeys, integrating them with change guardrails to balance velocity and reliability.
- Implement and standardize metrics, logs, and traces; build actionable dashboards and alerts that improve signal‑to‑noise.
- Tune alert policies to reduce noise and improve MTTD/MTTR, leveraging APM/AIOps to accelerate root‑cause analysis.
- Build and maintain CI/CD pipelines (e.g., Jenkins, GitHub Actions, GitLab CI), manage artifact/versioning, and orchestrate environment promotions.
- Enable pre/post‑deploy checks, canary/blue‑green strategies where feasible, and automated rollback to reduce change failure rate.
- Develop Python‑based automation for self‑healing, runbook execution, health checks, and operational workflows with tests and code quality gates and practical working experience with high-availability (clusters, failover) and networking (latency, load balancing, firewall) concepts
- Be responsible for the overall Windows infrastructure and software implementation and configuration of 3rd party solutions
- Learns and applies system processes, methodologies, and skills for the development of secure, stable code and systems
- Adds to team culture of diversity, opportunity, inclusion, and respect
Required qualifications, capabilities, and skills
- Formal training or certification on software engineering concepts and 3+ years applied experience
- Hands-on practical experience in system design, application development, testing, and operational stability
- Formal training or certification in site reliability engineering (SRE) concepts, with at least 3 years of applied SRE experience.
- Expertise in observability practices, including white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus and Splunk.
- Strong understanding of site reliability culture and principles, with practical experience implementing SRE within applications or platforms.
- Experience with continuous integration and continuous delivery (CI/CD) tools, including Jenkins, GitLab, and Terraform.
- Familiarity with troubleshooting common networking technologies and issues, and demonstrated ability to work collaboratively in large teams, communicate effectively, address roadblocks proactively, and implement innovative solutions while staying current with emerging technologies.
- Run high‑quality incident management: take on‑call, drive fast recovery, conduct PIRs, and prevent repeat issues.
- Experience in developing, debugging, and maintaining code in a large corporate environment with one or more modern programming languages and database querying languages
-
Exposure to agile methodologies such as CI/CD, Application Resiliency, and Security
Preferred qualifications, capabilities, and skills
- Excellent debugging and trouble shooting skills
- Hands-on experience on Genetec Security Desk
- Emerging knowledge of software applications and technical processes within a technical discipline (e.g., cloud, artificial intelligence, machine learning, mobile, etc.)
Apply your skillsets to drive innovation and modernize the world's most complex and mission-critical systems