in

Host your LLMs on Cloud Run


Developers love Cloud Run for its simplicity, fast autoscaling, scale-to-zero capabilities, and pay-per-use pricing. Those same benefits come into play for real-time inference apps serving open gen AI models. That’s why today, we’re adding support for NVIDIA L4 GPUs to Cloud Run, in preview.

This opens the door to many new use cases to Cloud Run developers:

  • Performing real-time inference with lightweight open models such as Google’s open Gemma (2B/7B) models or Meta’s Llama 3 (8B) to build custom chat bots or on-the-fly document summarization, while scaling to handle spiky user traffic. 

  • Serving custom fine-tuned gen AI models, such as image generation tailored to your company’s brand, and scaling down to optimize costs when nobody’s using them.

  • Speeding up your compute-intensive Cloud Run services, such as on-demand image recognition, video transcoding and streaming, and 3D rendering.

As a fully managed platform, Cloud Run lets you run your code directly on top of Google’s scalable infrastructure, combining the flexibility of containers with the simplicity of serverless to help boost your productivity. With Cloud Run, you can run frontend and backend services, batch jobs, deploy websites and applications, and handle queue processing workloads — all without having to manage the underlying infrastructure.

At the same time, many workloads that perform AI inference, especially applications that demand real-time processing, require GPU acceleration to deliver responsive user experiences. With support for NVIDIA GPUs, you can perform on-demand online AI inference using the LLMs of your choice in seconds. With 24GB of vRAM, you can expect fast token rates for models with up to 9 billion parameters, including Llama 3.1(8B), Mistral (7B), Gemma 2 (9B). When your app is not in use, the service automatically scales down to zero so that you are not charged for it.

“With the addition of NVIDIA L4 Tensor GPU and NVIDIA NIM support, Cloud Run provides users a real-time, fast-scaling AI inference platform to help customers accelerate their AI projects and get their solutions to market faster — with minimal infrastructure management overhead.” – Anne Hecht, Senior Director of Product Marketing, NVIDIA

Early customers are excited about the combination of Cloud Run and NVIDIA GPUs.

“Cloud Run’s GPU support has been a game-changer for our real-time inference applications. The low cold-start latency is impressive, allowing our models to serve predictions almost instantly, which is critical for time-sensitive customer experiences. Additionally, Cloud Run GPUs maintain consistently minimal serving latency under varying loads, ensuring our generative AI applications are always responsive and dependable — all while effortlessly scaling to zero during periods of inactivity. Overall, Cloud Run GPUs have significantly enhanced our ability to provide fast, accurate, and efficient results to our end users.” – Thomas MENARD, Head of AI – Global Beauty Tech, L’Oreal

“Cloud Run GPUs are hands-down the best way to consume GPU compute on Google Cloud. I love how it provides a high degree of control and customizability using open-source standards (Knative) as well as great observability tools out of the box, together with fully managed infrastructure that scales to zero. And since we can easily migrate to GKE using Knative primitives, there is always an option to get even more control at the cost of higher complexity and maintenance. GPU allocation and startup times were also faster for our use-case compared to most competing services.” – Alex Bielski, Director of Innovation, Chaptr

Using NVIDIA GPUs on Cloud Run

Today, we support attaching one NVIDIA L4 GPU per Cloud Run instance, and you do not need to reserve your GPUs in advance. To start, Cloud Run GPUs are available today in us-central1(Iowa), with availability in europe-west4 (Netherlands) and asia-southeast1 (Singapore) expected before the end of the year. 

To deploy a Cloud Run service with NVIDIA GPUs, add the --gpu=1 flag to specify the number of GPUs and --gpu-type=nvidia-l4 flag to specify the type of GPU in the command line. Or, you can do this from the Google Cloud console:


AI Agent for Software Testing with KaneAI

AI Agent for Software Testing with LambdaTest’s KaneAI

The US Government Wants You—Yes, You—to Hunt Down Generative AI Flaws

The US Government Wants You—Yes, You—to Hunt Down Generative AI Flaws