Accelerating AI Inference with Google Cloud TPUs and GPUs

Customers such as Osmos are using JetStream to accelerate their LLM inference workloads:

“At Osmos, we’ve developed an AI-powered data transformation engine to help companies scale their business relationships through the automation of data processing. The incoming data from customers and business partners is often messy and non-standard, and needs intelligence applied to every row of data to map, validate, and transform it into good, usable data. To achieve this we need high-performance, scalable, cost-efficient AI infrastructure for training, fine-tuning, and inference. That’s why we chose Cloud TPU v5e with MaxText, JAX, and JetStream for our end-to-end AI workflows. With Google Cloud, we were able to quickly and easily fine-tune Google’s latest Gemma open model on billions of tokens using MaxText and deploy it for inference using JetStream, all on Cloud TPU v5e. Google’s optimized AI hardware and software stack enabled us to achieve results within hours, not days.” – Kirat Pandya, CEO, Osmos

By providing researchers and developers with a powerful, cost-efficient, open-source foundation for LLM inference, we’re powering the next generation of AI applications. Whether you’re a seasoned AI practitioner or just getting started with LLMs, JetStream is here to accelerate your journey and unlock new possibilities in natural language processing.

Experience the future of LLM inference with JetStream today. Visit our GitHub repository to learn more about JetStream and get started on your next LLM project. We are committed to developing and supporting JetStream over the long term on GitHub and through Google Cloud Customer Care. We are inviting the community to build with us and contribute improvements to further advance the state of the art.

MaxDiffusion: High-performance diffusion model inference

Just as LLMs have revolutionized natural language processing, diffusion models are transforming the field of computer vision. To reduce our customers’ costs of deploying these models, we created MaxDiffusion: a collection of open-source diffusion-model reference implementations. These implementations are written in JAX and are highly performant, scalable, and customizable – think MaxText for computer vision.

MaxDiffusion provides high-performance implementations of core components of diffusion models such as cross attention, convolutions, and high-throughput image data loading. MaxDiffusion is designed to be highly adaptable and customizable: whether you’re a researcher pushing the boundaries of image generation or a developer seeking to integrate cutting-edge gen AI capabilities into your applications, MaxDiffusion provides the foundation you need to succeed.

The MaxDiffusion implementation of the new SDXL-Lightning model achieves 6 images/s on Cloud TPU v5e-4, and throughput scales linearly to 12 images/s on Cloud TPU v5e-8, taking full advantage of the high performance and scalability of Cloud TPUs