Member of Technical Staff - Data Infrastructure Engineer (DevOps|SRE|Platform Engineering|MLOps
Microsoft
Member of Technical Staff - Data Infrastructure Engineer (DevOps|SRE|Platform Engineering|MLOps
New York City, New York, United States
Save
Overview
As Microsoft continues to lead the frontier of artificial intelligence, we are seeking passionate and driven engineers to solve some of the most challenging and impactful AI problems of our time. Our vision is bold: to build intelligent systems across agents, applications, services, and infrastructure — and to make this intelligence universally accessible for consumers, businesses, and developers alike.
Microsoft AI (MAI) is looking for an experienced Data Infrastructure Engineer to join the team behind personal AI and Copilot systems. We are building mission-critical platform components that drive data pipelines, enable seamless human-AI interactions, and power the evolution of intelligent systems. This role blends platform engineering, DevOps/SRE practices, and MLOps to support large-scale data workflows and AI model development.
You’ll bring technical depth, a passion for automation and observability, fluency in distributed systems, and the creativity to architect solutions that scale. Just as importantly, you’ll bring empathy, a collaborative spirit, and a growth mindset to support a world-class engineering culture.
This position is based in New York, NY or Redmond, WA, with an in-office requirement of 3 days per week.
Qualifications
Required Qualifications:
- Bachelor’s degree in Computer Science, Mathematics, or a related field AND 4+ years experience in a data infrastructure, DevOps, SRE, or MLOps role supporting high-volume, low-latency data systems
- OR equivalent experience
- 3+ years experience managing and scaling distributed systems, from bare-metal to Kubernetes, including deep knowledge across the full stack (UI, middleware, platform services)
- 2+ years building and deploying containerized applications with Kubernetes and Helm/Kustomize.
- Proficiency in scripting and automation using languages such as Python, Bash, or PowerShell with Proven experience in automating operational tasks, including health checks, alerting, and observability for data and ML systems.
- Demonstrated success in troubleshooting and supporting critical production systems with managing CI/CD pipelines and release automation.
Preferred Qualifications:
- Experience with Azure, AWS, or GCP and cloud-native data infrastructure.
- Hands-on experience with modern data storage and processing technologies, including relational and NoSQL databases, key-value stores, Spark compute engines, distributed file systems such as HDFS and ADLS Gen2, as well as messaging systems like Event Hub, Kafka, and RabbitMQ.
- Collaboration experience with Data Engineer, Data Scientists, ML Engineers, Networking, and Security teams.
- Familiarity with modern web stacks: Typescript, Node.js, React, PHP (a plus).
- Understanding of MLOps principles: model training pipelines, artifact versioning, and experiment tracking.
- Familiarity with agentic workflows, deep learning, or AI frameworks is an advantage.
- Practical experience using LLMs (e.g., GPT-based models) in daily workflows — such as automating documentation, code generation, code review, or operational intelligence.
- Demonstrated understanding of prompt engineering techniques to effectively design, optimize, and evaluate interactions with large language models (LLMs).
- Ability to resolve complex performance and scalability issues across services and infrastructure layers.
- Interpersonal and communication skills, with a passion for continuous learning and mentorship.
- Experience applying LLMs to accelerate DevOps tasks, enhance incident response, or streamline cross-functional collaboration is a strong plus.
Data Engineering IC4 - The typical base pay range for this role across the U.S. is USD $119,800 - $234,700 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $158,400 - $258,000 per year.
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay
Microsoft will accept applications for the role until June 9, 2025.
#MicrosoftAI #Copilot
Responsibilities
- Design, build, and maintain scalable, reliable, and observable data and ML infrastructure that powers mission-critical AI applications.
- Implement DevOps and SRE best practices, including automated deployments, service monitoring, and incident response.
- Develop self-service tooling and workflows that streamline developer and researcher productivity.
- Create robust CI/CD pipelines and automate infrastructure provisioning using Infrastructure as Code (Bicep, Terraform, ARM).
- Collaborate closely with AI researchers, platform engineers, and application developers to deliver seamless and secure data workflows.
- Participate in technical design reviews and contribute to maintaining a clean, secure, and well-documented codebase.
- Proactively identify and resolve bottlenecks and inefficiencies in data pipelines and infrastructure.
- Embody and promote Microsoft’s culture and values of respect, integrity, accountability, and inclusion.