Introducing Trillium, sixth-generation TPUs | Google Cloud Blog

Generative AI is transforming how we interact with technology while simultaneously opening tremendous efficiency opportunities for business impact. But these advances require ever greater compute, memory, and communication to train and fine tune the most capable models and to serve them interactively to a global user population. For more than a decade, we at Google have been developing custom AI-specific hardware, Tensor Processing Units, or TPUs, to push forward the frontier of what is possible in scale and efficiency.

This hardware supported a number of the innovations we announced today at Google I/O, including new models like Gemini 1.5 Flash, Imagen 3, and Gemma 2; all of these models have been trained on and are served using TPUs. To deliver the next frontier of models and enable you to do the same, we’re excited to announce Trillium, our sixth-generation TPU, the most performant and most energy-efficient TPU to date.

Trillium TPUs achieve an impressive 4.7X increase in peak compute performance per chip compared to TPU v5e. We doubled the High Bandwidth Memory (HBM) capacity and bandwidth, and also doubled the Interchip Interconnect (ICI) bandwidth over TPU v5e. Additionally, Trillium is equipped with third-generation SparseCore, a specialized accelerator for processing ultra-large embeddings common in advanced ranking and recommendation workloads. Trillium TPUs make it possible to train the next wave of foundation models faster and serve those models with reduced latency and lower cost. Critically, our sixth-generation TPUs are also our most sustainable: Trillium TPUs are over 67% more energy-efficient than TPU v5e.

Trillium can scale up to 256 TPUs in a single high-bandwidth, low-latency pod. Beyond this pod-level scalability, with multislice technology and Titanium Intelligence Processing Units (IPUs), Trillium TPUs can scale to hundreds of pods, connecting tens of thousands of chips in a building-scale supercomputer interconnected by a multi-petabit-per-second datacenter network.

The next phase of AI innovation with Trillium

More than a decade ago, Google recognized the need for a first-of-its-kind chip for machine learning. In 2013, we began work on the world’s first purpose-built AI accelerator, TPU v1, followed by the first Cloud TPU in 2017. Without TPUs, many of Google’s most popular services — such as real-time voice search, photo object recognition, and interactive language translation, along with the state-of-the-art foundation models such as Gemini, Imagen, and Gemma — would not be possible. In fact, the scale and efficiency of TPUs enabled foundational work on Transformers in Google Research, the algorithmic underpinnings of modern generative AI.

4.7X increase in compute performance per Trillium chip

TPUs were designed from the ground up for neural networks, and we’re always working to improve training and serving times for AI workloads. Trillium achieves 4.7X peak compute per chip compared to TPU v5e. To achieve this level of performance, we’ve expanded the size of matrix multiply units (MXUs) and increased the clock speed. Additionally, SparseCores accelerate embedding-heavy workloads by strategically offloading random and fine-grained access from TensorCores.

2X ICI and High Bandwidth Memory (HBM) capacity and bandwidth

Doubling the HBM capacity and bandwidth allows Trillium to work with larger models with more weights and larger key-value caches. Next-generation HBM enables higher memory bandwidth, improved power efficiency, and a flexible channel architecture to increase memory throughput. This improves training time and serving latency for large models. That’s twice the model weights and key-value caches, accessed faster and with more compute capacity for accelerating ML workloads. Doubling the ICI bandwidth enables training and inference jobs to scale to tens of thousands of chips powered by a strategic combination of custom optical ICI interconnects with 256 chips in a pod and Google Jupiter Networking that extends scalability to hundreds of pods in a cluster.

Trillium will power the next generation of AI models

Trillium TPUs will power the next wave of AI models and agents, and we’re looking forward to helping enable our customers with these advanced capabilities. For example, autonomous vehicle company Nuro is dedicated to creating a better everyday life through robotics by training their models with Cloud TPUs; Deep Genomics is powering the future of drug discovery with AI and looking forward to how their next foundational model, powered by Trillium, will change the lives of patients; and Deloitte, Google Cloud Partner of the Year for AI, will offer Trillium to transform businesses with generative AI. Support for training and serving of long-context, multimodal models on Trillium TPUs will also enable Google DeepMind to train and serve the future generations of Gemini models faster, more efficiently, and with lower latency than ever before.