Cloud TPU v5e is generally available

Customers deploy Cloud TPU v5e for AI training and serving

Customers rely on large clusters of Cloud TPU v5e to train and serve cutting-edge LLMs quickly and efficiently. AssemblyAI, for example, is working to democratize access to cutting-edge AI speech models, and has achieved remarkable results on TPU v5e.

“We recently had the opportunity to experiment with Google’s new Cloud TPU v5e in GKE to see whether these purpose-built AI chips could lower our inference costs. After running our production Speech Recognition model on real-world data in a real-world environment, we found that TPU v5e offers up to 4x greater performance per dollar than alternatives.” – Domenic Donato, VP of Technology at AssemblyAI

Separately, in early October, we collaborated with Hugging Face on a demo that showcases using TPU v5e to accelerate inference on Stable Diffusion XL 1.0 (SDXL). Hugging Face Diffusers now support serving SDXL via JAX on Cloud TPUs, thus enabling both high-performance and cost-effective inference for content-creation use cases. For instance, in the case of text-to-image generation workloads, running SDXL on a TPU v5e with eight chips can generate eight images in the same time it takes for one chip to create a single image.

The Google Bard team has also been using Cloud TPU v5e for training and serving its generative AI chatbot.

“TPU v5e has been powering both ML training and inference workloads for Bard since the early launch of this platform. We are very delighted with the flexibility of TPU v5e that can be used for both training runs at a large scale (thousands of chips) and for efficient ML serving that supports our users in over 200 countries and in over 40 languages.” – Trevor Strohman, Distinguished Software Engineer, Google Bard

Start powering your AI production workloads using TPU v5e today

AI acceleration, performance, efficiency, and scale continue to play vital roles in the pace of innovation, especially for large models. Now that Cloud TPU v5e is GA, we cannot wait to see how customers and ecosystem partners push the boundaries of what’s possible. Get started today with Cloud TPU v5e by contacting a Google Cloud sales specialist today.

^{1. MLPerf™ v3.1 Training Closed, multiple benchmarks as shown. Retrieved November 8th, 2023 from} ^{mlcommons.org}^{. Results 3.1-2004. Performance per dollar is not an MLPerf metric. TPU v4 results are unverified: not verified by MLCommons Association. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See} ^{www.mlcommons.org} ^{for more information.
2. Scaling factor is ratio of (throughput at given cluster size) / (throughput at the base cluster size). Our base cluster size is one v5e pod (e.g., 256 chips). Example: at 512-chip scale, we have 1.9 times the throughput at 256-chip scale, therefore leading to a scaling factor of 1.9.
3. To derive TPU v5e performance per dollar, we divide the training throughput per chip (measured in tokens/sec) by the on-demand list price $1.20, which is the} ^{publicly available}^{price per chip-hour (US$) for TPU v5e in the us-west4 region. To derive TPU v4 performance per dollar, we divide the training throughput per chip (measured in tokens/sec; internal Google Cloud results, not verified by MLCommons Association) by the on-demand list price of $3.22, the} ^{publicly available}^{on-demand price per chip-hour (US$) for TPU v4 in the us-central2 region.}