Redmond, Washington, USA
24 days ago
Senior Software Engineer

Microsoft is a company where passionate innovators come to collaborate, envision what can be and take their careers further. This is a world of more possibilities, more innovation, more openness, and the sky is the limit thinking in a cloud-enabled world. 

  

The AI Platform organization at Microsoft builds the end-to-end Azure AI stack/PaaS (Platform as a Service) and is core to Azure’s innovation and differentiation, as well as the AI-related capabilities of all of Microsoft’s flagship products, from Office to Teams, to Xbox. We are the team building Azure OpenAI, Azure ML, AI Studio, Cognitive Services, and the global Azure AI infrastructure for running the largest AI workloads on the planet.  

  

We do not just value differences or different perspectives. We seek them out and invite them so we can tap into the collective power of everyone in the company. As a result, our customers are better served. 
 
Within AI Platform, the Azure ML Services team enables data scientists and developers to quickly and easily build, train, deploy, manage, and consume machine learning models. 

  

As part of Azure ML, the AI Infra team is looking for a Senior Software Engineer, with initial focus on the Scheduler subsystem. The scheduler is the “brains” of the AI Infra control plane. It governs access to the GPU and NPU capacity of the platform according to a complex system of placement constraints, preference rules, and dynamically interacting policies aimed to maximize hardware utilization and fulfill greatly varying needs of users and the AI Platform partner services in terms of workload types, prioritization, and capacity targeting flexibility. The scheduler manages quota, capacity reservations, SLA tiers, preemption, auto-scaling, and a wide range of configurable policies. Global scheduling is a distinctive feature that overcomes the regional segmentation of the Azure compute fleet by treating the GPU capacity as a single global virtual pool, which greatly increases capacity availability and utilization for major classes of ML workload. Our system can manage GPU capacity even outside the Azure datacenters.

           

To be able to manage the complexity of all scheduling policies and placement constraints and meet the expectations of high service reliability, availability, and throughput, we emphasize rigorous engineering, utmost precision and quality, and ownership—from feature design to livesite. Quality mindset, unit-testing proficiency, attention to detail, development process rigor are key for success in our mission-critical control plane space.

  

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. 

  

In alignment with our Microsoft values, we are committed to cultivating an inclusive work environment for all employees to positively impact our culture every day. 

  

By applying to this U.S. based position, while remote work is possible, relocation does not apply/is not provided for the role.

Confirm your E-mail: Send Email