We are in the midst of an exciting era of AI-driven innovation and transformation. Today we announced AI Hypercomputer, a groundbreaking architecture that employs an integrated system of AI-optimized hardware, software, and consumption models. With AI Hypercomputer, enterprises everywhere can run on the same cutting-edge infrastructure that is already the backbone of Google’s internal AI/ML research, development, training, and serving.
But the overwhelming demand for TPUs and NVIDIA GPUs makes effective resource management more crucial than ever.
To address this, today we are excited to announce Dynamic Workload Scheduler, a new, simple, and powerful approach to get access to GPUs and TPUs. This blog is for technical audiences to deep-dive into what it is, how it works, and how you can use it today.
What is Dynamic Workload Scheduler?
Dynamic Workload Scheduler is a resource management and job scheduling platform designed for AI Hypercomputer. Dynamic Workload Scheduler improves your access to AI/ML resources, helps you optimize your spend, and can improve the experience of workloads such as training and fine-tuning jobs, by scheduling all the accelerators needed simultaneously. Dynamic Workload Scheduler supports TPUs and NVIDIA GPUs, and brings scheduling advancements from Google ML fleet to Google Cloud customers. Dynamic Workload Scheduler is also integrated in many of your preferred Google Cloud AI/ML services: Compute Engine Managed Instance Groups, Google Kubernetes Engine, Vertex AI, Batch, and more are planned.
Two modes: Flex Start and Calendar
Dynamic Workload Scheduler introduces two modes: Flex Start mode for enhanced obtainability and optimized economics, and Calendar mode for high predictability on job start times.
1. Flex Start mode: Efficient GPU and TPU access with better economics
Flex Start mode is designed for fine-tuning models, experimentation, shorter training jobs, distillation, offline inference, and batch jobs. With Flex Start mode, you can request GPU and TPU capacity as your jobs are ready to run.
With Dynamic Workload Scheduler in Flex Start mode, you submit a GPU capacity request for your AI/ML jobs by indicating how many you need, a duration, and your preferred region. Dynamic Workload Scheduler intelligently persists the request; once the capacity becomes available, it automatically provisions your VMs enabling your workloads to run continuously for the entire duration of the capacity allocation. Dynamic Workload Scheduler supports capacity requests for up to seven days, with no minimum duration requirement. You can request capacity for as little as a few minutes or hours; typically, the scheduler can fulfill shorter requests more quickly than longer ones.
If your training job finishes early, you can simply terminate the VMs to free up the resources and only pay for what your workload actually consumed. You no longer need to hold onto idle resources just to use them later.
If you’re using GKE node pools for your AI/ML workloads, an easy way to use Dynamic Workload Scheduler is through orchestrators such as Kueue. Popular ML frameworks such as Ray, Kubeflow, Flux, PyTorch and other training operators are supported out of the box. Here are the steps to enable this:
Step 1: Create a node pool with the “enable-queued-provisioning” option enabled.