Manager of Site Reliability Engineer

JPMorganChase

JPMorganChase

Software Engineering

Hyderabad, Telangana, India

Posted on May 15, 2026

Guide and shape the future of technology at a globally recognized firm, driven by pride in ownership.

As a SRE Manager at JPMorgan Chase within the Consumer & Community Banking, you are the non-functional requirement owner and champion for the applications in your remit. You are a key influencer in your team’s strategic planning, driving continual improvement in customer experience, resiliency, security, scalability, monitoring, instrumentation, and automation of the software in your area. You act in a blameless, data-driven manner and navigate difficult situations with composure and tact.


Job Responsibilities:

  • Define and enforce quality gates across requirements, design, secure coding, testing, release, and post-production monitoring, translate business objectives into clear, testable requirements that include reliability, availability, performance, security, and observability.
  • Establish and manage SLOs/SLIs and error budgets; ensure they are integrated into product roadmaps and delivery plans, challenge Product Owners and teams to meet a rigorous, objective Definition of Done before release.
  • Sample DoD checklist: SLOs defined and monitored; alerts tuned; runbooks and escalation paths in place; automated tests (unit, integration, security) passing; performance and capacity validated; resilience and failover tested; rollback verified; vulnerability findings remediated; compliance controls and audit artifacts complete; documentation and support readiness confirmed.
  • Lead operational readiness reviews and triage risks; ensure timely remediation and prevention of recurrence through root-cause analysis and auto-remediation.
  • Maintain logging, alerting, and monitoring platforms; ensure dashboards provide health and performance visibility. Govern CI/CD pipeline controls for security, reliability, and change management; promote automation to eliminate toil.
  • Lead and participate in critical incident response (including outside business hours when needed); drive post-incident reviews and resilience improvements. Monitor delivery health and operational KPIs; lead continuous improvement across teams and products
  • Oversee capacity planning and resilience management for large-scale, distributed systems, Partner with engineering on public cloud best practices (AWS or equivalent) for compute, storage, networking, messaging, automation (CloudFormation, Terraform), and data services.
  • Build a culture of collaboration, reliability, and continuous improvement; coach teams to adopt DevOps and SRE principles. Partner with regional engineering leaders to drive operational best practices and consistent execution. Provide concise, outcome-focused updates to management and stakeholders; influence decisions across Product, Engineering, SRE, and Security.

Required Qualifications, Capabilities, and Skills

  • Formal training or certification with 5+ years supporting critical finance-focused applications in large-scale environments and managing and mentoring teams.
  • Solid understanding of AI-assisted solutions to accelerate root cause analysis and reduce overall TTX with appropriate validation and human judgment
  • Experience with monitoring/logging tools (e.g., Splunk, AppDynamics) and dashboard technologies;
  • Strong grasp of SDLC, secure development, DevOps/CI/CD tooling; capable of implementing top-tier continuous improvement with root-cause analysis and auto-remediation.
  • Effective under pressure; accountable, with excellent stakeholder management and communication skills.
  • This position may require HSA system access. Enhanced screening (criminal and credit background checks, and/or other screening) is required prior to employment and annually thereafter.
  • Global team collaboration with flexibility to engage during critical incidents outside standard business hours
  • Experience implementing and managing SLOs/SLIs, error budgets, and operational readiness reviews for distributed systems, including leading post-incident analysis and resilience improvements.
  • Deep expertise in public cloud platforms (AWS or equivalent), infrastructure automation tools (CloudFormation, Terraform), and capacity planning for large-scale environments, with a track record of driving DevOps and SRE adoption across teams.

Preferred Qualifications

  • Splunk Administrator certification desired.


Lead your team’s daily activities and align the firm’s site reliability priorities to the goals of your team