Site Reliability Engineer
Millennium Management
We are seeking an experienced Site Reliability Engineer (SRE) specialized in the Observability space to join our team. This role will be responsible for the design and implementation of observability solutions that ensure the reliable, performance, and scalable infrastructure. In addition, this role will involve reviewing our current observability stack, planning for future enhancements, implementing new solutions, and collaborating with developers to create actionable insights through effective dashboards and automated alerting systems. The ideal candidate will have a strong background in analytics and experience with advanced monitoring techniques to help us achieve metrics baselining, anomaly detection, and enhanced correlation and causation analysis.
Responsibilities
- Conduct thorough reviews of our existing observability stack to identify areas for improvement and optimization
- Collaborate with the team to plan and design the next version of our observability infrastructure
- Assist in the implementation of the new observability stack, ensuring seamless integration and minimal disruption
- Create and maintain insightful and actionable dashboards that provide clear visibility into system performance without adding unnecessary noise
- Review existing alerts and work closely with developers to automate alert handlers for self-healing systems
- Utilize your experience in analytics to perform metrics baselining and anomaly detection, ensuring our systems are operating optimally
- Explore and integrate AI tools to enhance our correlation and causation analysis capabilities
- Develop and maintain necessary components such as metrics exporters and self-service tools
Required Skills:
- Demonstrated experience as a Site Reliability Engineer, Observability Engineer, or similar role in software development
- Hands-on experience with the Prometheus ecosystem
- Ability to design and develop code in Python or Go
- Acute drive to automate manual operations and processes
- Strong understanding of Linux operating systems
- Hands-on experience with configuration management tools such as Ansible, SaltStack, or Terraform
- Experience in managing and scaling distributed systems
- Strong sense of ownership and integrity, demonstrated through clear communication and collaboration
- Excellent troubleshooting and problem-solving skills
- Ability to communicate complex concepts clearly with both stakeholders and developers