Cloud Solution Architecture - Infrastructure
Other Engineering, IT
Kuala Lumpur, Malaysia · Bangkok, Thailand · Jakarta, Indonesia
The Infrastructure Cloud Solution Architect (CSA) serves as a trusted technical advisor for Microsoft's most strategic and mission-critical customers. This role helps customers improve the reliability, resilience, security, performance, and operational excellence of their Azure environments through proactive assessments, technical guidance, incident leadership, and cross-functional collaboration. Working within a global follow-the-sun operating model, the CSA collaborates closely with customers, Microsoft Engineering, Support, and Customer Success teams across multiple regions and time zones to drive rapid incident resolution, operational improvements, and long-term business outcomes. Success requires deep technical expertise, strong customer advocacy, and the ability to navigate complex operational challenges while influencing stakeholders across diverse organizations and cultures.
Responsibilities
Trusted Advisor & Customer Advocacy
Act as a trusted technical advisor, helping customers improve the reliability, resiliency, security, performance, and operational maturity of mission-critical workloads running on Azure.
Advise customers and stakeholders on architecture, operations, and best practices aligned with the Azure Well-Architected Framework.
Actively listen to and understand customer priorities, advocate on their behalf within Microsoft, and drive outcomes measured through customer satisfaction, operational excellence, and business impact.
Build strong technical relationships with customers and Microsoft stakeholders, establishing credibility through deep technical expertise and trusted guidance.
Communicate complex technical concepts and recommendations in clear, actionable terms to both technical and executive audiences.
Incident Leadership & Operational Excellence
Lead complex troubleshooting efforts across infrastructure, platform, and application layers, including critical and high-severity incidents.
Operate effectively in high-stakes, customer-impacting incidents, combining platform expertise and customer business context to accelerate mitigation, recovery, and restoration of service.
Facilitate Root Cause Analysis (RCA) activities for critical incidents, helping customers identify corrective and preventative actions that reduce future risk.
Analyze support cases, operational telemetry, incident trends, and platform events to identify recurring risks and recommend proactive remediation measures.
Drive reduction of reactive operational demand through reliability-focused recommendations, operational maturity improvements, resiliency best practices, and service optimization initiatives.
Promote operational excellence across reliability, availability, security, performance, recoverability, and capacity management.
Proactive Risk Management & Continuous Improvement
Perform proactive health assessments, risk reviews, and operational analysis to identify opportunities for improvement and escalation prevention.
Maintain a culture of curiosity by looking beyond immediate symptoms and root causes to understand systemic factors, historical decisions, and operational patterns that drive long-term improvements.
Correlate customer requirements, operational events, and platform signals into actionable recommendations with clear accountability and ownership.
Drive operational maturity through recommendations for observability, monitoring, automation, governance, reliability engineering practices, disaster recovery preparedness, and service management processes.
Utilize telemetry, monitoring platforms, observability tools, and query languages to investigate issues, identify trends, and develop actionable insights.
Customer Engagement & Service Delivery
Develop and maintain deep technical understanding of assigned customer environments, architectures, dependencies, and mission-critical workloads.
Create and maintain customer knowledge documentation, operational records (KnowMe), and workload profiles.
Deliver onboarding assessments and help define service delivery and improvement plans aligned with customer objectives.
Scope technical engagements, facilitate discussions on workstreams, prioritize recommendations, and align stakeholders on action plans and expected outcomes.
Track remediation progress and drive alignment across customers and Microsoft stakeholders.
Global Collaboration & Stakeholder Management
Operate effectively within a global follow-the-sun support model, collaborating with teams across multiple regions and time zones to ensure continuity of service for mission-critical workloads.
Maintain awareness of ongoing customer engagements, incidents, escalations, and engineering activities occurring outside local business hours, incorporating relevant developments into ongoing service delivery.
Drive effective cross-time-zone coordination through structured handoffs, action tracking, stakeholder alignment, and knowledge sharing.
Build strong partnerships across Microsoft Engineering, Support, Customer Success, Product Groups, and other stakeholders to accelerate issue resolution and drive customer outcomes.
Communicate complex technical and operational topics clearly across diverse technical, business, and cultural audiences.
Establish trusted technical relationships with both customers and Microsoft stakeholders, enabling effective collaboration during critical incidents, proactive engagements, and strategic initiatives.
Build and strengthen partnerships across Microsoft teams, including Engineering, Azure Engineering Direct (AED), Azure Rapid Response (ARR), Customer Success Account Managers (CSAMs), Support, and other stakeholders.
Collaborate effectively across teams, cultures, and organizational boundaries to drive customer success and operational improvements.
Success Measures
Success in this role is measured through:
Improvements in workload reliability, resiliency, security, and operational maturity.
Adoption of recommended architecture, operational practices, and remediation plans.
Reduction in customer-impacting incidents, repeat escalations, and operational risk.
Faster mitigation and recovery of critical incidents.
Effective coordination across global teams, ensuring seamless customer support and operational continuity across regions and time zones.
Increased customer satisfaction and trusted advisor influence.
Positive business outcomes through improvements in reliability, security, performance, capacity management, and service resilience.
Qualifications
- Bachelor’s Degree in Computer Science, Information Technology, Engineering, or a related field, AND 7+ years of relevant experience supporting mission-critical production environments; OR equivalent practical experience
- Experience supporting mission-critical production environments
- Experience leading or coordinating Sev A / P1 incidents
- Experience providing recommendations to enterprise customers
- Experience improving reliability, resiliency, performance, security, or operational maturity
- Experience working across multiple time zones and globally distributed teams
- Experience coordinating multiple technical teams to resolve customer issues
- Experience with telemetry, monitoring, logging, and root-cause analysis
- Experience with DR, HA, BCP, and recovery planning
This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.