How Cloud TPU v5e accelerates large-scale AI inference

Google Cloud’s AI-optimized infrastructure makes it doable for companies to coach, fine-tune, and run inference on state-of-the-art AI fashions sooner, at better scale, and at decrease value. We’re excited to announce the preview of inference on Cloud TPUs. The brand new Cloud TPU v5e permits high-performance and cost-effective inference for a broad vary AI workloads, together with the newest state-of-the-art giant language fashions (LLMs) and generative AI fashions.

As new fashions are launched and AI turns into extra subtle, companies require extra highly effective and value environment friendly compute choices. Google is an AI-first firm, so our AI-optimized infrastructure is constructed to ship the worldwide scale and efficiency demanded by Google merchandise like YouTube, Gmail, Google Maps, Google Play, and Android that serve billions of customers — in addition to our cloud clients.

LLM and generative AI breakthroughs require huge quantities of computation to coach and serve AI fashions. We’ve custom-designed, constructed, and deployed Cloud TPU v5e to cost-efficiently meet this rising computational demand.

Cloud TPU v5e is a superb selection for accelerating your AI inference workloads:

Price Environment friendly: As much as 2.5x extra efficiency per greenback and as much as 1.7x decrease latency for inference in comparison with TPU v4.
Scalable: Eight TPU shapes assist the total vary of LLM and generative AI mannequin sizes, as much as 2 trillion parameters.
Versatile: Strong AI framework and orchestration assist.

On this weblog, we’ll dive deeper into how one can leverage TPU v5e successfully for AI inference.

As much as 2.5x extra efficiency per greenback and as much as 1.7x decrease latency for inference

Every TPU v5e chip supplies as much as 393 trillion int8 operations per second (TOPS), permitting advanced fashions to make quick predictions. A TPU v5e pod consists of 256 chips networked over ultra-fast hyperlinks. Every TPU v5e pod delivers as much as 100 quadrillion int8 operations per second, or 100 PetaOps, of compute energy.

We optimized the Cloud TPU inference software program stack to take full benefit of this highly effective {hardware}. The inference stack leverages XLA, Google’s AI compiler, which generates highly-efficient code for TPUs to maximise efficiency and effectivity.

The mixed {hardware} and software program optimizations, together with int8 quantization, allow Cloud TPU v5e to attain as much as 2.5x better inference efficiency per greenback than Cloud TPU v4 on state-of-the-art LLM and generative AI fashions, together with Llama 2, GPT-3, and Steady Diffusion 2.1: