Tech Site Reliability Engineer (AI Model Ops)
UBS
Tech Site Reliability Engineer (AI Model Ops)
India
Information Technology (IT)
Group Functions
Your role
We are seeking a highly motivated and experienced Site Reliability Engineer to join our growing AI Model Operations team to help build, maintain, and monitor a portfolio of products for AI operations in UBS.
As an SRE, you will play a crucial role in ensuring the reliability, performance, and scalability of our production systems. You will find opportunities to improve our production reliability, applying software engineering principles to infrastructure and operations problems.
• own the reliability of central AI models and agents registry, deployment pipelines, AI SecOps products, and other products in our portfolio
• ensure the quality, security, reliability, and compliance of solutions by applying SRE best practices
• own incident management, root cause analysis, and implement preventative measures.
• support capacity, disaster recovery planning, and cost management
• collect and analyze operational data and identify SLI’s from key metrics to define achievable SLO’s for the product set
• collaborate with data scientists and other stakeholders to collect feedback and incorporate it into solutions
• automate processes leveraging predictive monitoring, auto-scaling, or self-healing
• apply performance analysis, log analytics, automated testing and communicate areas for improvement
Job Reference #
314132BR
City
Pune
Job Type
Full Time
Your team
We are a multinational team with diverse backgrounds and we strongly value different perspectives, skills, and close collaboration which help us build the best product for our clients. If your interests and skills closely match a good part of the description below – don’t hesitate to apply!
Your expertise
• hands-on cloud experience utilizing both the cloud portal & CLI to deploy, monitor, troubleshoot and enhance services that our products rely upon. (Azure preferred)
• plotting metrics and creating dashboards to monitor service health
• identifying signals from logs or metrics and creating appropriate alerts
• experience working with the Azure and 3rd party APIs to develop automation/scripts or return data about Azure services
• experience working with Azure and proficiency in running container workloads
• identify cloud cost optimizations whilst balancing reliability and availability of services
• experience with managing and operating cloud workspaces (e.g. Databricks)
• proficiency with a high-level software language. (Python preferred), experience with DevOps. (Gitlab preferred)
About us
UBS is the world’s largest and the only truly global wealth manager. We operate through four business divisions: Global Wealth Management, Personal & Corporate Banking, Asset Management and the Investment Bank. Our global reach and the breadth of our expertise set us apart from our competitors..
We have a presence in all major financial centers in more than 50 countries.
Join us
At UBS, we embrace flexible ways of working when the role permits. We offer different working arrangements like part-time, job-sharing and hybrid (office and home) working. Our purpose-led culture and global infrastructure help us connect, collaborate, and work together in agile ways to meet all our business needs.
From gaining new experiences in different roles to acquiring fresh knowledge and skills, we know that great work is never done alone. We know that it's our people, with their unique backgrounds, skills, experience levels and interests, who drive our ongoing success. Together we’re more than ourselves. Ready to be part of #teamUBS and make an impact?
Disclaimer / Policy statements
UBS is an Equal Opportunity Employer. We respect and seek to empower each individual and support the diverse cultures, perspectives, skills and experiences within our workforce.