Senior AI Operations Engineer
Microsoft
The Security AI Platform team builds and operates production infrastructure that powers AI-native security capabilities at Microsoft scale. The AI Operations group is responsible for deployments, CI/CD pipelines, production reliability, and first-level on-call. We work closely with the Platform + Apps team who develops the core product features.
We are seeking a Senior AI Operations Engineer to lead operational initiatives and serve as a technical anchor for the operations team. In this role, you will own complex infrastructure systems, drive deployment automation, lead incident response, and mentor junior engineers. You will be a hands-on technical leader who balances operational excellence with continuous improvement.
Responsibilities
- Lead Kubernetes operations: AKS cluster management, Helm chart development, node pool configuration, GPU resource allocation, and KEDA autoscaling
- Own CI/CD pipeline development: Azure DevOps/GitHub Actions pipelines, build optimization, deployment automation, and release orchestration
- Drive production deployments: implement canary/ring rollout strategies, manage deployment schedules, and execute safe rollback procedures
- Build and maintain observability systems: Prometheus metrics, Grafana dashboards, OpenTelemetry collectors, alerting rules, and log aggregation (Kusto/Loki)
- Lead incident response: participate in on-call rotation, diagnose production issues, coordinate resolution, and conduct post-incident reviews
- Debug and diagnose complex issues: analyze distributed traces, query Kusto/ADX logs, correlate metrics, and identify root causes
- Manage Infrastructure as Code: develop and maintain Bicep templates, Helm charts, and ensure environment consistency
- Ensure branch health: maintain PR validation pipelines, security scanning, automated testing gates, and merge policies
- Collaborate with Platform team: review service designs for operability, define monitoring requirements, and validate deployment procedures
- Mentor AI Operations Engineer II team members; conduct reviews and establish operational best practices
- Drive reliability improvements: capacity planning, performance tuning, and infrastructure optimization
Qualifications
Required Qualifications
- Bachelor's Degree in Computer Science or related technical field OR equivalent experience
- 4+ years technical engineering experience in DevOps, SRE, or platform operations
- 4+ years hands-on experience with Kubernetes in production environments
- 3+ years building and maintaining CI/CD pipelines
- 2+ years experience with production incident response and on-call responsibilities
Qualifications: Required Tools & Technologies
- Proficiency with Kubernetes: cluster operations, Helm, troubleshooting, and workload management
- Experience with CI/CD platforms: Azure DevOps, GitHub Actions, or similar pipeline tooling
- Familiarity with cloud platforms (Azure preferred): AKS, networking, and cloud-native services
- Infrastructure as Code: Bicep, Terraform, or Helm chart development
- Observability tooling: Prometheus, Grafana, distributed tracing, and log analytics
Other Requirements: Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings:
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
#MSFT Security #SentinelPlatform #MSECAI #MSECAINEXT
This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.