Vice President, Site Reliability Engineer Lead, Application Production Services & Engineering
Bank of America
Software Engineering
United States · Singapore · Remote
Job Description:
At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. We do this by driving Responsible Growth and delivering for our clients, teammates, communities and shareholders every day.
Being a Great Place to Work and providing a culture of caring is core to how we drive Responsible Growth. We are intentional about fostering an inclusive workplace where every teammate has the opportunity to succeed, build a career and contribute to our shared success. This includes attracting and developing exceptional talent, recognizing and rewarding performance, and supporting our teammates’ physical, emotional, and financial wellness through affordable, competitive and flexible benefits.
We value the unique perspectives individuals bring from all backgrounds and career paths - whether shaped by military service, community college education, or a wide range of work and life experiences. These journeys foster resilience, leadership and innovation, strengthening our workforce and positively impact the communities we serve.
Bank of America is committed to an in-office culture that supports collaboration, engagement, and career development. Our approach includes clear in-office expectations, while providing an appropriate level of flexibility based on role-specific responsibilities and business needs.
At Bank of America, you can build a successful career with opportunities to learn, grow, and make an impact. Join us!
Job Description:
This job is responsible for partnering with engineering and technology teams to implement measures prescribed by the Site Reliability Engineer teams it leads. Key responsibilities include ensuring appropriate instrumentation, tooling, ticketing, alerting and on call routines are in place for key services, demonstrating technical expertise within domains, and decomposing objectives into work units. Job expectations include advancing efficient solution delivery practices and promoting exceptional design, engineering, and organizational practices.
Responsibilities:
- Collaborate with Development and Infrastructure teams to understand technical solutions and implement monitoring capabilities outlined in the application and system monitoring designs put forward by the Senior Site Reliability Engineer (SRE)
- Develops and maintains reliability scripts, tools and libraries and leverages them for common instrumentation, automation, and operational needs, and when mentoring SRE resources on reliability practices and established tools/capabilities
- Partner implement code changes to make use of common reliability libraries and tools and helps Application Production Services and Application Development teammates understand how to use them
- Participates regularly in architecture community of practice meetings and communication via other channels
- Identifies vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and 'noise' in monitoring, and defines solutions to reduce manual support effort and/or improve system reliability
- Engages as a subject matter expert in major incident triage efforts and failure scenario modelling and diagnosis with Problem Manager root causes for major incident/problem management investigations
- Define and maintain a multi-year stability roadmap aligned with business objectives and technology strategy
- Identify critical dependencies, risks, and mitigation strategies across infrastructure, applications, and services
- Work with the architects to develop and adhere to the enterprise architectural patterns and frameworks that enhance system reliability and fault tolerance
- Ensure designs adhere to best practices for high availability, disaster recovery, and performance optimization
- Establish stability metrics, KPIs, and compliance standards for technology teams
- Drive adoption of reliability engineering principles across development and operations
- Partner with engineering, operations, and product teams to embed stability into the software development lifecycle
- Act as a trusted advisor to senior leadership on stability-related initiatives and investments
- Monitor emerging technologies and industry trends to enhance stability strategies
- Lead post-incident reviews and ensure lessons learned are incorporated into future designs
- Collaborate with Development and Infrastructure teams to understand technical solutions and to implement the monitoring capabilities outlined in the application and system monitoring designs put forward by the Senior SRE
- Develop and maintain a catalog of extensible reliability scripts, tools, and libraries that can be leveraged for common instrumentation, automation and operational needs
- Partner to implement code changes to make use of common reliability libraries and tools and help the Application Production Services (APS) and Application Development teammates understand how to use them
- Partner with infrastructure engineers and application teams to implement the necessary code changes to make use of common reliability libraries and tools and help the APS and Application Development of teammates to understand how to use them
- Engage as a subject matter expert (SME) in major incident triage efforts, failure scenario modelling and work with the Problem Manager to diagnose root causes for major incident / problem management investigations
- Identify vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and 'noise' in monitoring, and to help define solutions to reduce manual support effort and/or improve system reliability
Required Skills:
- 8+ years in technology architecture, reliability engineering, or infrastructure strategy roles
- Proven track record of delivering stability-focused initiatives in large-scale environments
- Strong knowledge of distributed systems, cloud architecture (AWS, Azure, GCP), and microservices
- Experience with reliability engineering, chaos testing, and observability tools
- Ability to influence cross-functional teams and communicate complex concepts to non-technical stakeholders
Desired Skills:
- SRE Certification
- Automation
- Collaboration
- Influence
- Production Support
- Result Orientation
- Analytical Thinking
- Application Development
- Architecture
- Solution Design
- Stakeholder Management
- Adaptability
- DevOps Practices
- Project Management
- Risk Management
- Solution Delivery Process