Engineering Manager
Microsoft
Azure Portal! Azure Co-Pilot! Azure Cloud Natives & Startups!
The mission of the Azure portal is to democratize the cloud via a fast, simple, productive and beautiful single-pane-of-glass, where customers can explore, learn, acquire, manage and monitor all Azure products and services!
The Azure Portal is the main interface of Azure that millions of customers use every month. It provides a unified management experience, targeted toward developers and IT professionals, that brings together diverse experiences in data centers around the globe. The portal is built to be highly extensible and provides a framework that is used by hundreds of teams to create their user experiences for Azure. We are agile, flexible, customer-focused, and constantly on top of technology trends! We are entrusted with enabling people and organizations to achieve more!
As an Engineering Manager on the team, you will be responsible for collaborating with partner teams to deliver key azure experiences. You will not only be responsible for envisioning, designing, coding, validating, and shipping large features, but will also be responsible with the ownership of the overall architecture of the product, while working with the leadership team to help them drive product decisions based on your knowledge of the architecture and your deep understanding of the customer scenarios and customer code architecture. You will have the opportunity to work with a wide variety of people from across the company giving you wide exposure and a broad surface to impact in a positive manner.
Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Responsibilities
- Leads team on the disciplined use of, and improving artificial intelligence (AI) tools and practices across the software development lifecycle (SDLC). Guides team on proactively taking responsibility for the content of their AI-generated requirements, design documents, code, and other assets, and assisting other members of the team to do the same. Leads team on incorporating Responsible AI practices into the SDLC to ensure appropriate controls over AI-generated assets. Coaches team on applying SDLC and engineering health measures (e.g., Accelerate, SPACE framework, Engineering System Success Playbook [ESSP]) to guide improvements to processes and practices, especially those involving AI. Leads team on experimenting with AI tools and practices to improve their own capabilities, and providing recommendations on how to adopt them to others.
- Reviews debugging tools, tests, logs, telemetry, and other methods, and acts as an expert for others to proactively verify assumptions while developing code before issues occur across products in production. Conducts incident retrospectives to identify root causes of problems, and leads teams on implementing repair actions and identifying mechanisms to prevent incident recurrence. Tracks and attempts to minimize cost of debugging multiple scenarios. Leads team on proactively applying least-access principles, using logging, telemetry, and other appropriate mechanisms to investigate issues while retaining privacy and security, and driving those practices.
- Guides team within and across teams on producing extensible, maintainable, well-tested, secure, and performant code that adheres to design specifications. Leads efforts to continuously improve code performance, testability, maintainability, effectiveness, and cost, while learning about and accounting for relevant trade-offs. Identifies best practices and coding patterns and provides deep expertise on the coding and validation strategy. Defines or reuses quality metrics, best practices, and coding patterns to ensure testable code. Leads by example in best code-writing practices (e.g., leveraging state-of-the-art generative artificial intelligence [GenAI], approaches to source code organization, naming conventions). Leads team on identifying and anticipating blockers or unknowns during the development process, escalating them, communicating how they will impact timelines, and leading efforts to identify and implement strategies and/or opportunities to address them.
- Coaches others to review product code and test code to ensure it meets team standards, contains the correct test coverage, and is appropriate for the product or solution area. Leads team on bringing insight to code reviews to help improve code quality, coaching and providing feedback to develop other engineers' skills. Conducts code reviews in a timely fashion that helps accelerate the pace of development on the team. Considers diagnosability, reliability, testability, and maintainability when reviewing code and understands when code is ready to be shared or delivered. Applies and reviews for coding patterns, security risks, compliance issues, and best practices in code reviews, providing feedback on code to drive adherence to best practices. Advocates for and ensures team uses automated source code analysis tools that are incorporated into the build/development process.
- Guides team in creating clear test strategy that ensures solution quality, prevents regression from being introduced into existing code. Guides team on ensuring test plans incorporate security testing to validate security invariants (including negative cases). Leads adding new tests to cover gaps, deleting or fixing broken tests, and improving the speed, reliability, and defect localization of the overall test suite. Leads efforts to ensure the scalability and reliability of the test framework during design. Leads team on building testable code and considering testability during design for a set of solutions. Guides team on understanding the different types of tests that can be done on a particular system (e.g., unit tests), and maintains up-to-date understanding of testing architectures used both across Microsoft and across the industry, and applying them across the architecture as appropriate. Leads team on identifying and executing plans for redesigning or rearchitecting difficult or untestable sections of code for a set of solutions. Advocates for and leverages artificial intelligence (AI) tools for test automation.
- Guides teams on and leads identifying dependencies and incorporating them into the development of design documents for a product, application, service, or platform. Leads the active identification of other teams and technologies to leverage, how they interact, and where their own system or team can support others. Demonstrates deep understanding of upstream and downstream interactions between systems and ensures security, compliance, performance, and reliability can be achieved across the entire stack. Coordinates and collaborates with other teams to reach common goals where dependencies and validation concerns overlap. Enables communications and negotiates across teams to resolve conflicts among dependency ownership and required work. Drives agreements between dependent teams to align to the delivery schedule.
- Guides others on owning and leading efforts and discussions for architecture of aspects of complex products/solutions (e.g., design, cost). Looks for opportunities to leverage and incorporate open-source, off-the-shelf components and/or preexisting, managed services as part of the design of products/architecture. Guides others on leading the testing and exploration of various design options across a set of complex product/solution scenarios, ensuring the strengths and weaknesses of each option are outlined and making recommendations for which design option is best. Creates proposals for architecture and design documents, and leads testing of hypotheses and proposed complex solutions. Shares and acts on findings from investigations and owns design decisions and oversees the less experienced team members. Guides others and writes design documents that support user stories and other product requirements. Evaluates new technologies to solve classes of problems, and determines how to integrate these technologies within existing systems. Leads design discussions with the team and shares findings/learnings from investigations, owns design decisions. Guides employees to consider, and leads efforts to ensure system architecture and individual designs meet performance, scalability, resiliency, disaster recovery, cost of goods sold (COGS), and other requirements and expectations. Guides employees on upholding Microsoft standards of security, privacy, and other compliance requirements and expectations. Guides team on understanding the importance of building solutions that expand upon the work of others. Leads the refinement of products through data analytics and makes informed decisions in engineering products through data integration. Leads team on reviewing complex designs/architectures within and across teams to provide recommendations for improvements.
- Identifies and acts as an expert in best practices and shares information with other engineers for building code that is based on well-established methods and secure design principles while also applying best practices for new code development and formal validation of security invariants. Leads product development and scaling to customer requirements, and applies best practices for meeting scaling needs and performance expectations and security promises, and holds accountability for products that do not meet expectations. Advocates for use of GenAI tooling (e.g., GitHub CoPilot) to enhance engineering productivity.
- Guides team and leads efforts to ensure the correct processes are followed to achieve a high degree of security, privacy, safety, and accessibility across solutions and teams. Guides team to create and assure presence of visible evidence (e.g., audit trail) to demonstrate compliance for products. Guides team to maintains a deep understanding of the implications of onboarding new technologies following expectations of compliance at Microsoft. Maintains an up-to-date understanding and advises engineering teams on both global and local regulations for technologies and system applications to ensure regulations are followed and met.
- Guides team on leading the identification of requirements for, and the comprehensive application of automation within production and deployment across products, targeting zero-touch deployment when possible. Guides others to run code in simulated, or other non-production environments to confirm functionality and error-free runtime across products. Ensures a continuous integration/continuous deployment (CICD) infrastructure is in place that promotes developer and operational agility (e.g., low lead-time-to-change metrics).
- Identifies skills needed and ensures engineering team's skills remain current by investing time and effort into being informed of current developments. Proactively seeks new knowledge, evaluating new trends, technical solutions, and patterns, assessing how to adapt them to current problems, and shares knowledge with other engineers. Coaches team on conducting learning and literary sessions to raise awareness on relevant engineering design principles (e.g., security, testability, performance, scalability, accessibility, product knowledge).
- Coaches team on understanding and applying security best practices and establishes code invariants to model "security as code," ensuring each layer is independently secure, and minimizing risk. Guides team on supporting and/or adopting, and potentially setting security standards for clear security code review practices for a set of products that align with design and engineering principles to raise the security hardening for both protections and detections. Leads team on proactively incorporating deployment gates on security controls, and scanners for a set of products to prevent regressions and/or vulnerabilities that would have customer impact. Ensures that their team Includes required security monitoring to ensure detection of violations. Collaborates with relevant security partners to define security promises and security invariants for the design of a product/solution while factoring in attacker/investigator personas for security monitoring and telemetry needs, ensure threat models and premortems validate upstream and downstream assumptions and security invariants, establish security breach drills and security incident response processes (e.g., impact analysis, containment), and ensure that artificial intelligence (AI) safety features are implemented for the AI production systems tied to a set of products.
- Guides the decision-making process around tool development. Oversees resourcing of tool development and reuse within the team. Ensures the team identifies whether open sources or internal code are available to address coding needs for a set of products, and uses or reuses them in a responsible manner as applicable. Guides others on, and uses and enhances, or builds, new software developer tools to support easier, faster, and more effective software engineering for products. Develops skills in tools outside current areas of expertise.
- Leads team in collaborating with partner teams to ensure a set of products works well with the components of the partner team, ensuring proper end-to-end testing, live-site coverage, scalability, performance, and DRI escalation pathways are established before going live.
- Guides team on driving multiple groups' project plans, release plans, and work items in coordination with appropriate stakeholders (e.g., technical program managers). Breaks down very long-term project vision into milestones. Guides other members on project estimation. Anticipates future goals to guide future resources. Reviews, implements, and recommends updates to resource management in response to changing context. Guides team on driving efforts to ensure required security protections and detection processes are accounted for in planning. Guides team on driving efforts to ensure project plans adhere to security, privacy, and compliance requirements. Guides team on driving efforts to ensure all code for a set of products/solutions is properly flighted for quicker mitigation of production incidents. Leads team on calculating adequate capacity for planning, accounting for appropriate failover and backup/restore mechanisms for disaster recovery for a set of products and/or solutions. Drives efforts to ensure team makes considerations for efficient operation of a set of products and/or solutions after it is live. Leads team on proactively establishing rollback plans for a set of products and/or solutions.
- Acts as an expert to others for deployment appropriate environments, and about automating deployment tasks when possible to ensure efficiency. Establishes standards for the correct measures to deploy products. Leads team on and proactively follows safe change deployment best practices (e.g., ensuring that flights are set correctly) for their team to minimize adverse impact to users and other services. Optimizes deployments across products and components to meet differing business objectives. Guides others on and leads efforts to ensure that solutions are deployed in a safe manner, rolling out security-sensitive features only to applicable, relevant customers and scenarios to reduce the attack surface. Guides others on and leads efforts to proactively monitor dependency status and ensure that only the latest, secure versions are deployed. Leads team in defining when rollback plans should be enacted for a set of products. Ensures deployment infrastructure is in place to allow developers' private builds for a set of products/solutions to be tested in a production-like environment.
- Leveraging internal experimentation infrastructures, acts as an expert and guides team experiments that determine the effectiveness of changes using feature flags/flighting in their code, ensures the correct telemetry are in place for the features of interest, interprets results, and makes a decision on next steps or ship decision from results. Ensures there are time and resources for engineers to conduct experiments. Leads team on driving collaboration efforts with internal partners (e.g., Data Science, product managers) to ensure incorporation of success and guard rail metrics for experimentation.
- Managers deliver success through empowerment and accountability by modeling, coaching, and caring. Model: Live our culture. Embody our values. Practice our leadership principles. Coach: Define team objectives and outcomes. Enable success across boundaries. Help the team adapt and learn. Care: Attract and retain great people. Know each individual’s capabilities and aspirations. Invest in the growth of others.
- Ensures ongoing support for services or products are robust and effective through effective telemetry and incident response processes and adhering to security best practices for the most critical or highest impact spaces such as those with deep technical domains connections or a broad set of products or services at critical junctures (e.g., early in development, urgent time horizon). Establishes guidelines and policies for creating telemetry and novel processes or tools for telemetry, engaging in live site maintenance and responding to incidents. Provides technical oversight on telemetry in systems and products to provide feedback on system behavior such as performance, reliability, availability, utility, and implements safety mechanisms resulting in iterative feedback loops resulting in subsequent monitoring designs. Provides technical oversight and expertise in efforts to classify, analyze, and interpret data and analyses on a range of metrics (e.g., health of the system, where bugs might be occurring), and technical leadership for creating outputs (e.g., notifications, dashboards) that improve monitoring and investigating security-related concerns and scenarios, system monitoring and/or issue identification and mitigation. Ensures appropriate systems are enacted to reduce incident volume and severity, meet the strategic needs of the product or service, and drives a live site first mentality. Establishes relevant metrics to determine live site response capabilities and successful incident response. Guides and oversees for creating and integrating iterative feedback loops on telemetry data for future product generation. Defines and disseminates best practices across Microsoft for proactively creating and improving troubleshooting guides (TSGs), wikis, tests, and telemetry to make on-call better, and establishing user-facing support documentation and additional test coverage that reduce the likelihood of future user-initiated incidents. Champions creating opportunities (e.g., lunch talks, automation, practices, tools) that can be leveraged to improve the live site experience and executes on them. Contributes to defining their division's strategy for enabling secure operations, security monitoring, and integration with live site investigation activities. Establishes best practices for considering and addressing the privacy implications of telemetry code changes and adding new data points.
- Integrates, designs, and reviews others work across a team or product to integrate logging and instrumentation for gathering telemetry data on system behavior such as performance, reliability, availability, utility, and safety mechanisms, and for allowing monitoring and investigating security-related concerns and scenarios for both live and A/B experiments for products, services, and offerings. Leads team on leveraging telemetry feedback and effectiveness to drive the improvement of subsequent monitoring designs. Ensures solutions are scalable, financially responsible, and meet capture/storage guidelines. Guides team and leads efforts to classify, and analyze complex data and analyses on a range of metrics (e.g., health of the system, where bugs might be occurring). Leads creation of outputs (e.g., notifications, dashboards) that improve monitoring and investigating security-related concerns and scenarios, system monitoring and/or issue identification and mitigation. Coaches team on considering the privacy implications of telemetry code changes, and of adding new data points.
- Acts as an expert for others' operations of live site service and following security best practices when responding quickly to mitigate complex issues while using the minimum required permissions to do so that arise on a rotational, on-call basis. Reviews systematic issues and ensure solutions. Ensures playbooks are logical and understandable. Establishes standardized processes and guides others that implement solutions and mitigations to issues impacting performance or functionality of live site services. Reviews and writes complex incident postmortem and presents insights that drive changes to reduce or eliminate incidents across teams. Leads team on and proactively improves troubleshooting guides (TSGs), wikis, tests, and telemetry to make on-call better, and recommends user-facing support documentation and additional test coverage to reduce likelihood of future user-initiated incidents. Enables and may drive the enablement of secure operations, security monitoring, and integration with live site investigation activities. Leads team on identifying opportunities (e.g., lunch talks, automation, practices, tools) that can be leveraged to improve the live site experience and executing on them. Ensures comprehensive observability and monitoring is in place for all the services their team oversees, and advocates for the inclusion of automated incident response/open source tooling, mitigation, analysis, and self-healing in those services.
- Guides team and acts as an expert for designated responsible individual (DRI) and monitors other engineers across product lines, working on call to monitor system/product/service for degradation, downtime, or interruptions. Alerts stakeholders to status and establishes actions to restore system/product/service. Develops a playbook, guidelines, and processes for the team to resolve issues. Coordinates people and resources to ensure DRI responsibilities are covered across teams. Ensures responses are within service level agreement (SLA) timeframe. Has line of sight to incidences and plans to address emerging issues. Leads efforts to reduce incident volume, looking globally at incidences and providing broad resolutions. Ensures overall DRI effectiveness and health of their team.
- Guides partnership with appropriate internal (e.g., product manager, privacy/security subject matter expert, technical lead) and external (e.g. customer escalation team, public forums) stakeholders and leverages expertise to determine and confirm customer/user requirements and their feasibility within and across teams. Seeks and leverages a variety of feedback channels to incorporate customer insights into current and future designs or solution fixes. Guides team on incorporation of unwritten requirements, such as appropriate continuous feedback loops that measure actionable, quantitative (e.g., customer value, usage patterns, solution performance) and qualitative (e.g., accessibility, globalization) indicators of value. Determines additional critical metrics for success. Leads team in understanding and leading the provision of feedback on, and the advocacy of the security and privacy needs of the customer who will be using the set of solutions.
Qualifications
Required Qualifications:
- Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, HTTP, HTML, JavaScript, CSS, ASP.NET, Node.js, REST APIs
- OR equivalent experience.
- 2 + years people management experience.
Preferred Qualifications:
- Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to,HTTP, HTML, JavaScript, CSS, ASP.NET, Node.js, REST APIs
- OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, HTTP, HTML, JavaScript, CSS, ASP.NET, Node.js, REST APIs
- OR equivalent experience.
- 5+ years people management experience.
Other Requirements:
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
#azurecorejobs
This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.