Google Cloud’s container platform for the next decade of AI

GKE for training and inference

For customers who value an open, portable, cloud-native, and customizable platform for their AI workloads, GKE is ideal. The tremendous growth in AI adoption continues to be reflected in how customers are using our products : Over the last year, the use of GPUs and TPUs on Google Kubernetes Engine has grown more than 900%.

To better meet the needs of customers transforming their businesses with AI, we’ve built innovations that let you train and serve the very largest AI workloads, cost effectively and seamlessly. Let’s dive into each of those three: scale, cost efficiency, and ease of use.

Large-scale AI workloads
Many recent AI models demonstrate impressive capabilities, thanks in part to their very large size. As your AI models become larger, you need a platform that’s built to handle training and serving massive AI models. We continue to push the limits of accelerator-optimized hardware to make GKE an ideal home for your large-scale AI models:

Cloud TPU v5p, which we announced in December and is now generally available, is our most powerful and scalable TPU accelerator to date. By leveraging TPU v5p on GKE, Google Cloud customer, Lightricks has achieved a remarkable 2.5X speedup in training their text-to-image and text-to-video models compared to TPU v4.
A3 Mega, which we announced today, is powered by NVIDIA’s H100 GPUs and provides 2x more GPU to GPU networking bandwidth than A3, accelerating the time to train the largest AI models with GKE. A3 Mega will be generally available in the coming weeks.

Training the largest AI models often requires scaling far beyond a physical TPU. To enable continued scaling, last year we announced multi-slice training on GKE, which is generally available, enabling full-stack, cost-effective, large-scale training with near-linear scaling up to tens of thousands of TPU chips. We demonstrated this capability by training a single AI model using over 50,000 TPU v5e chips while maintaining near-ideal scaling performance.

Cost-efficient AI workloads
As AI models continue to grow, customers face many challenges to scaling in a cost effective way. For example, AI container images can be massive, causing cold start times to balloon. Keeping AI inference latency low requires overprovisioning to handle unpredictable load, but slow cold-start times require compensating by overprovisioning even more. All of this creates under-utilization and unnecessary costs.

GKE now supports container and model preloading, which accelerates workload cold start — enabling you to improve GPU utilization and save money while keeping AI inference latency low. When creating a GKE node pool, you can now preload a container image or model data in new nodes to achieve much faster workload deployment, autoscaling, and recovery from disruptions like maintenance events. Vertex AI’s prediction service, which is built on GKE, found container preloading resulted in much faster container startup:

“Within Vertex AI’s prediction service, some of our container images can be quite large. After we enabled GKE container image preloading, our 16GB container images were pulled up to 29x faster in our tests.” – Shawn Ma, Software Engineer, Vertex AI

For AI workloads that have highly variable demand such as low-volume inference or notebooks, a GPU may sit idle much of the time. To help you run more workloads on the same GPU, GKE now supports GPU sharing with NVIDIA Multi-Process Service (MPS). MPS enables concurrent processing on a single GPU, which can improve GPU efficiency for workloads with low GPU resource usage, reducing your costs.

To maximize the cost efficiency of AI accelerators during model training, it’s important to minimize the time an application is waiting to fetch data. To achieve this, GKE supports GCS FUSE read caching, which is now generally available. GCS FUSE read caching uses a local directory as a cache to accelerate repeat reads for small and random I/Os, increasing GPU and TPU utilization by loading your data faster. This reduces the time to train a model and delivers up to 11x more throughput.

Ease of use for AI workloads
With GKE, we believe achieving AI scale and cost efficiency shouldn’t be difficult. GKE makes obtaining GPUs for AI training workloads easy by using Dynamic Workload Scheduler, which has been transformative for customers like Two Sigma:

“Dynamic Workload Scheduler improved on-demand GPU obtainability by 80%, accelerating experiment iteration for our researchers. Leveraging the built-in Kueue and GKE integration, we were able to take advantage of new GPU capacity in Dynamic Workload Scheduler quickly and save months of development work.” – Alex Hays, Software Engineer, Two Sigma

For customers who want Kubernetes with a fully managed mode of operation, GKE Autopilot now supports NVIDIA H100 GPUs, TPUs, reservations, and Compute Engine committed use discounts (CUDs).

Traditionally, using a GPU required installing and maintaining the GPU driver on each node. However, GKE can now automatically install and maintain GPU drivers, making GPUs easier to use than ever before.

The enterprise platform for Day Two and beyond

Google Cloud’s managed container platform helps builders get started and scale up AI workloads. But while AI workloads are a strategic priority, there remains critical management and operations work in any enterprise environment. That’s why we continue to launch innovative capabilities that support all modern enterprise workloads.

This starts with embedding AI directly into our cloud. Gemini Cloud Assist helps you boost Day-two operations by:

Optimizing costs: Gemini will help you identify and address dev/test environments left running, forgotten clusters from experiments, and clusters with excess resources.
Troubleshooting: get a natural language interpretation of the logs in Cloud Logging.
Synthetic Monitoring: using natural language, you can now describe the target and user journey flows that you’d like to test, and Gemini will generate a custom test script that you can deploy or configure further based on your needs.

And it’s not just Day-two operations, Gemini Cloud Assist can help you deploy three-tier architecture apps, understand Terraform scripts and more, drastically simplifying design and deployment.

While AI presents a thrilling new frontier, we have not lost focus on the crucial elements of a container platform that serves modern enterprises. We’ve continued to invest in foundational areas that ensure the stability, security, and compliance of your cloud-native applications and were excited to introduce the following preview launches:

GKE threat detection, which identifies common container runtime attacks, analyzes suspicious code, and even uses natural language processing to pinpoint malicious scripts. And this is all integrated with Security Command Center for a comprehensive, cohesive approach to security.
GKE compliance, a fully managed compliance service that automatically delivers end-to-end coverage from the cluster to the container, scanning for compliance against the most important benchmarks. Near-real-time insights are always available in a centralized dashboard and we produce compliance reports automatically for you.