Software Engineering III - AMDP
JPMorganChase
Software Engineering
London, UK
Join us to shape the future of AI/ML data platforms and make a real impact on how we deliver market-leading solutions. You will collaborate with talented colleagues, solve complex challenges, and help drive strategic change across our organization. At JPMorganChase, you’ll find opportunities for growth, mentorship, and the chance to work with cutting-edge technologies. Your contributions will help us deliver resilient and innovative data solutions that power our business.
As a Site Reliability Engineer in the AI/ML Data Platforms team, you will play a key role in building and supporting scalable, resilient data solutions. You will engage in root cause analysis, production changes, and collaborate with cross-functional teams to drive improvements. You will also mentor team members and partner with colleagues across our global network. Your work will directly impact the reliability and performance of our AI/ML platforms.
Job Responsibilities:
- Develop and support AI/ML solutions for troubleshooting and incident resolution
- Coordinate incident management coverage to ensure effective resolution of application issues
- Collaborate with cross-functional teams to perform root cause analysis and implement production changes
- Apply expertise in application development and support using technologies such as Databricks, Snowflake, AWS, and Kubernetes
- Mentor and guide team members to drive strategic change
- Build tools to automate repeated tasks and reduce operational toil
- Ensure compliance with risk controls and company standards
- Contribute to system design, resiliency, testing, operational stability, and disaster recovery
- Foster a collaborative team environment to achieve common goals
Required Qualifications, Capabilities, and Skills:
- Proficient in site reliability culture and principles, with experience implementing them within applications or platforms
- Skilled in running production incident calls and managing incident resolution
- Experience with observability, including monitoring, alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, or Splunk
- Strong understanding of SLI/SLO/SLA and error budgets
- Proficiency in Python or PySpark for AI/ML modeling
- Ability to automate tasks and reduce toil through tool development
- Hands-on experience in system design, resiliency, testing, operational stability, and disaster recovery
- Awareness of risk controls and compliance with organizational standards
- Ability to work collaboratively and build meaningful relationships
Preferred Qualifications, Capabilities, and Skills:
- Experience in an SRE or production support role with AWS Cloud, Databricks, Snowflake, or similar technologies
- AWS and Databricks certifications
J.P. Morgan is a global leader in financial services, providing strategic advice and products to the world’s most prominent corporations, governments, wealthy individuals and institutional investors. Our first-class business in a first-class way approach to serving clients drives everything we do. We strive to build trusted, long-term partnerships to help our clients achieve their business objectives.
J.P. Morgan’s Commercial & Investment Bank is a global leader across banking, markets, securities services and payments. Corporations, governments and institutions throughout the world entrust us with their business in more than 100 countries. The Commercial & Investment Bank provides strategic advice, raises capital, manages risk and extends liquidity in markets around the world.
Drive innovation and reliability by building and supporting scalable AI/ML data solutions that empower our global teams.