New features to run AI more efficiently on fully managed GKE

Kubernetes is a popular way to run AI workloads like training, and large language model (LLM) serving, including our new open model Gemma. Google Kubernetes Engine (GKE) in Autopilot mode provides a fully managed Kubernetes platform that offers the power and flexibility of Kubernetes but without the need to worry about compute nodes, so you can focus on delivering your own business value through AI. Today we’re excited to announce the new Accelerator compute class in Autopilot that improves GPU support with resource reservation capabilities, and a lower price for most GPU workloads (you can opt in to this pricing today, and eventually all workloads will be migrated). In addition, a new Performance compute class enables high-performance workloads to run on Autopilot mode at scale. Both compute classes also have more available ephemeral storage right on the boot disk, giving you more room to download AI models, etc before needing to configure additional storage via generic ephemeral volumes. With these enhancements, using our fully managed Kubernetes platform for inference and other compute-intensive workloads is even better.

With GKE running in Autopilot mode you avoid the need to specify and provision nodes upfront, and can focus on building the workload and creating your own business value. As a fully managed platform, once your workload is built you can run it with less operational overhead. Today’s news sweetens the deal even further.

Lower-priced GPUs, better discounts

We’re lowering the price for the majority of GPU workloads running on GKE in Autopilot mode, and moving to a new billing model to improve compatibility with other products and experiences in Google Cloud. Now, you can move workloads between the Standard and Autopilot modes of GKE, as well as between Compute Engine VMs and keep your existing Reservations and committed use discounts.

When you enable the new pricing model (by specifying the Accelerate compute class as illustrated in the code sample below), resources are billed based on Compute Engine VM resources, plus a premium for the fully managed experience. Today the new pricing model is an opt-in; after April 30, versions of GKE will be released that automatically migrate GPU workloads to this new model. The price for most workloads resulting from these changes is lower (workloads on NVIDIA T4 GPUs with less than 2 vCPU per GPU see a slight price increase).

Here’s a comparison of the hourly prices for several workload sizes in the us-central1 region for GPU, CPU and Memory resources (storage additional):

GPU	Pod Resource Requests	VM resources	Old price (GPU Pod)	New price (Accelerator Compute Class Pod)
NVIDIA A100 80GB	1 GPU 11 vCPU 148 GB memory	1 GPU 12 vCPU 170 GB memory	$6.09	$5.59
NVIDIA A100 40GB	1 GPU 11 vCPU 74 GB memory	1 GPU 12 vCPU 85 GB memory	$4.46	$4.09
NVIDIA L4	1 GPU 11 vCPU 40 GB memory	1 GPU 12 vCPU 48 GB memory	$1.61	$1.12
NVIDIA T4	1 GPU 1 vCPU 1 GB memory	1 GPU 2 vCPU 2 GB memory	$0.46	$0.47
NVIDIA T4	1 GPU 20 vCPU 40 GB memory	1 GPU 22 vCPU 48 GB memory	$1.96	$1.37

When using the Accelerator compute class, the workload is billed for (and can utilize) the complete node VM capacity, including bursting into resources allocated for system Pods.

To opt in to these changes today, upgrade to version 1.28.6-gke.1095000 or later, and add the compute-class selector to your existing GPU workloads, like so: