My job alerts

Vice President, Site Reliability Engineer Lead, Application Production Services & Engineering

Bank of America

Software Engineering

United States · Singapore · Remote

Posted on May 21, 2026

Job Description:

At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. We do this by driving Responsible Growth and delivering for our clients, teammates, communities and shareholders every day.

Being a Great Place to Work and providing a culture of caring is core to how we drive Responsible Growth. We are intentional about fostering an inclusive workplace where every teammate has the opportunity to succeed, build a career and contribute to our shared success. This includes attracting and developing exceptional talent, recognizing and rewarding performance, and supporting our teammates’ physical, emotional, and financial wellness through affordable, competitive and flexible benefits.

We value the unique perspectives individuals bring from all backgrounds and career paths - whether shaped by military service, community college education, or a wide range of work and life experiences. These journeys foster resilience, leadership and innovation, strengthening our workforce and positively impact the communities we serve.

Bank of America is committed to an in-office culture that supports collaboration, engagement, and career development. Our approach includes clear in-office expectations, while providing an appropriate level of flexibility based on role-specific responsibilities and business needs.

At Bank of America, you can build a successful career with opportunities to learn, grow, and make an impact. Join us!

Job Description:
This job is responsible for partnering with engineering and technology teams to implement measures prescribed by the Site Reliability Engineer teams it leads. Key responsibilities include ensuring appropriate instrumentation, tooling, ticketing, alerting and on call routines are in place for key services, demonstrating technical expertise within domains, and decomposing objectives into work units. Job expectations include advancing efficient solution delivery practices and promoting exceptional design, engineering, and organizational practices.

Responsibilities:

Collaborate with Development and Infrastructure teams to understand technical solutions and implement monitoring capabilities outlined in the application and system monitoring designs put forward by the Senior Site Reliability Engineer (SRE)
Develops and maintains reliability scripts, tools and libraries and leverages them for common instrumentation, automation, and operational needs, and when mentoring SRE resources on reliability practices and established tools/capabilities
Partner implement code changes to make use of common reliability libraries and tools and helps Application Production Services and Application Development teammates understand how to use them
Participates regularly in architecture community of practice meetings and communication via other channels
Identifies vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and 'noise' in monitoring, and defines solutions to reduce manual support effort and/or improve system reliability
Engages as a subject matter expert in major incident triage efforts and failure scenario modelling and diagnosis with Problem Manager root causes for major incident/problem management investigations
Define and maintain a multi-year stability roadmap aligned with business objectives and technology strategy
Identify critical dependencies, risks, and mitigation strategies across infrastructure, applications, and services
Work with the architects to develop and adhere to the enterprise architectural patterns and frameworks that enhance system reliability and fault tolerance
Ensure designs adhere to best practices for high availability, disaster recovery, and performance optimization
Establish stability metrics, KPIs, and compliance standards for technology teams
Drive adoption of reliability engineering principles across development and operations
Partner with engineering, operations, and product teams to embed stability into the software development lifecycle
Act as a trusted advisor to senior leadership on stability-related initiatives and investments
Monitor emerging technologies and industry trends to enhance stability strategies
Lead post-incident reviews and ensure lessons learned are incorporated into future designs
Collaborate with Development and Infrastructure teams to understand technical solutions and to implement the monitoring capabilities outlined in the application and system monitoring designs put forward by the Senior SRE
Develop and maintain a catalog of extensible reliability scripts, tools, and libraries that can be leveraged for common instrumentation, automation and operational needs
Partner to implement code changes to make use of common reliability libraries and tools and help the Application Production Services (APS) and Application Development teammates understand how to use them
Partner with infrastructure engineers and application teams to implement the necessary code changes to make use of common reliability libraries and tools and help the APS and Application Development of teammates to understand how to use them
Engage as a subject matter expert (SME) in major incident triage efforts, failure scenario modelling and work with the Problem Manager to diagnose root causes for major incident / problem management investigations
Identify vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and 'noise' in monitoring, and to help define solutions to reduce manual support effort and/or improve system reliability

Required Skills:

8+ years in technology architecture, reliability engineering, or infrastructure strategy roles
Proven track record of delivering stability-focused initiatives in large-scale environments
Strong knowledge of distributed systems, cloud architecture (AWS, Azure, GCP), and microservices
Experience with reliability engineering, chaos testing, and observability tools
Ability to influence cross-functional teams and communicate complex concepts to non-technical stakeholders

Desired Skills:

SRE Certification
Automation
Collaboration
Influence
Production Support
Result Orientation
Analytical Thinking
Application Development
Architecture
Solution Design
Stakeholder Management
Adaptability
DevOps Practices
Project Management
Risk Management
Solution Delivery Process

See more open positions at Bank of America

Find Your Dream Job Today

Vice President, Site Reliability Engineer Lead, Application Production Services & Engineering