Senior Lead Site Reliability Engineer
JPMorganChase
Software Engineering
Glasgow, UK
Be an integral part of an agile team that's constantly pushing the envelope to enhance, build, and deliver top-notch reliability and observability for our most critical platforms.
As a Senior Lead Site Reliability Engineer at JPMorgan Chase within the Commercial & Investment Bank, you are an integral part of an agile team that works to enhance, build, and deliver trusted market-leading technology products in a secure, stable, and scalable way. Drive significant business impact through your capabilities and contributions, and apply deep technical expertise and problem-solving methodologies to tackle a diverse array of reliability, observability, and performance challenges that span multiple technologies and applications.
Job responsibilities
- Regularly provides technical guidance and direction on site reliability practices to support the business and its technical teams, contractors, and vendors
- Develops secure and high-quality production code for reliability tooling and telemetry pipelines, and reviews and debugs code written by others
- Drives decisions that influence reliability design, observability architecture, application functionality, and technical operations and processes
- Serves as a function-wide subject matter expert in one or more areas of site reliability, observability, or telemetry engineering
- Leads resiliency design reviews and breaks up complex reliability problems into digestible work for other engineers, acting as a technical lead for large-sized products
- Acts as the main point of contact during major incidents, demonstrating the skills to identify and solve issues quickly to avoid financial losses, and champions blameless postmortem culture
- Collaborates with team members and stakeholders to define comprehensive service level indicators, service level objectives, and error budgets
- Designs, implements, and maintains operational reliability for large-scale OpenTelemetry pipelines on hybrid on-prem/cloud environments, supporting telemetry ingestion, processing, and export to backends such as InfluxDB, Prometheus, Elasticsearch, and OpenSearch
- Drives the assessment, refactoring, and incremental migration of custom legacy telemetry collection code to standardized OpenTelemetry instrumentation, reducing technical debt while maintaining system stability
- Actively contributes to the engineering community as an advocate of firmwide frameworks, tools, and practices, and influences peers and project decision-makers to consider the use and application of leading-edge observability and reliability technologies
- Adds to the team culture of diversity, opportunity, inclusion, and respect
Required qualifications, capabilities, and skills
- Formal training or certification on software engineering concepts and advanced applied experience delivering system design, application development, testing, and operational stability
- Advanced knowledge of reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices, with considerable in-depth knowledge in one or more technical disciplines (e.g., cloud, observability, distributed systems, etc.)
- Advanced proficiency in one or more programming languages (e.g., Java, Python, Go, etc.)
- Advanced proficiency and experience in observability such as white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, Elasticsearch, etc.
- Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.)
- Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.)
- Hands-on experience with the design, deployment, and operation of OpenTelemetry collectors in production environments, focusing on technical aspects such as configuring, optimizing, and troubleshooting OTLP endpoints and receivers
- Ability to tackle reliability design and functionality problems independently with little to no oversight
- Practical cloud native experience
-
Ability to expand and collaborate across different levels and stakeholder groups
Preferred qualifications, capabilities, and skills
- Knowledge of distributed tracing, metrics, and logging best practices
- Certification in AWS, Kubernetes, or relevant technologies
- Proven track record in system health monitoring, capacity management, and blameless postmortems for high-availability services
- Deep understanding of distributed system design principles, networking (TCP/IP, DNS, load balancing), and Linux internals
- Contributions to open-source observability or telemetry projects
- Experience working with agent control planes and management protocols; hands-on knowledge of OpAMP is highly desirable
J.P. Morgan is a global leader in financial services, providing strategic advice and products to the world’s most prominent corporations, governments, wealthy individuals and institutional investors. Our first-class business in a first-class way approach to serving clients drives everything we do. We strive to build trusted, long-term partnerships to help our clients achieve their business objectives.
J.P. Morgan’s Commercial & Investment Bank is a global leader across banking, markets, securities services and payments. Corporations, governments and institutions throughout the world entrust us with their business in more than 100 countries. The Commercial & Investment Bank provides strategic advice, raises capital, manages risk and extends liquidity in markets around the world.
Provide technical guidance and serve as a function-wide SRE subject matter expert, driving reliability decisions across multiple products and influencing the adoption of leading-edge observability technologies.