hero

Find Your Dream Job Today

Out for Undergrad
companies
Jobs

Software Engineer, Data License Monitoring & Resiliency

Bloomberg

Bloomberg

Software Engineering
New York, NY, USA
Posted on Mar 28, 2026
Bloomberg delivers billions of data points to hundreds of thousands of customers daily. When markets move, our customers need their data immediately and without fail. That's not a nice-to-have. It's the whole product.
The Data License Monitoring & Resiliency team exists to make sure that promise holds. We build the systems that watch, test, and heal our production infrastructure before customers ever notice a problem. And we're using AI and large language models to fundamentally change how we handle reliability and customer support.
Right now, too much of our support workflow depends on humans doing repetitive, context-heavy work: reading tickets, diagnosing known issues, walking customers through the same resolution paths. We're building AI-powered systems to change that. Think LLM-driven chatbots that understand our product deeply enough to resolve support issues autonomously, context-aware triage systems that route and diagnose problems before a human ever gets involved, and intelligent knowledge bases that learn from every resolved incident.
On the reliability side, we're applying machine learning to anomaly detection, capacity forecasting, and incident response so we catch and fix problems faster than any manual process allows.
What you'd actually be working on:
  • AI-powered support automation. Design and build LLM-based chatbots and support tools that handle customer issues end-to-end. You'll work on context retrieval, prompt engineering, response quality evaluation, and the feedback loops that make these systems smarter over time.
  • Intelligent reliability tooling. Build AI-driven anomaly detection that spots degradation patterns before they become incidents. Develop predictive models for capacity management and failure forecasting across production infrastructure.
  • Automated chaos testing and game days. Design and run resilience tests, then use ML to analyze results and prioritize improvements that matter most.
  • Architecture advisory. Work directly with application development teams across Data License to review system designs, identify reliability risks early, and advocate for patterns that make services observable, scalable, and resilient by default. You'll be the person teams come to when they want to build something that won't break at 3am.
  • Incident response automation. Shrink mean-time-to-resolution by building intelligent runbooks that diagnose root causes and recommend or execute fixes without waiting for a human.
  • Toil elimination. If a human is doing it repeatedly, you'll be figuring out how to make a machine do it instead.
What you bring:
  • 4+ years writing production code in an object-oriented language (C/C++, Python, Java)
  • Degree in Computer Science, Engineering, Mathematics, or equivalent hands-on experience
  • Comfort working across the full stack, from application code down to infrastructure and hardware
  • A data-driven mindset: you measure before you optimize, and you're skeptical of gut-feel decisions
  • Strong communication skills. You'll be advising other teams on architecture decisions, so you need to explain trade-offs clearly and build consensus without authority.
  • Willingness to pick up new tools fast
What would set you apart:
  • Experience building with LLMs: prompt engineering, RAG pipelines, evaluation frameworks, or deploying conversational AI in production
  • Hands-on work with ML applied to operations: anomaly detection, predictive scaling, AIOps
  • Background in containerization (Docker, Kubernetes, Mesos)
  • Chaos engineering or game day experience
  • Infrastructure-as-code and configuration management tooling
  • Track record defining and measuring SLIs/SLOs for production services
  • Experience in a consulting or advisory role where you influenced engineering decisions across multiple teams