in

Running a High Throughput OpenAI-Compatible vLLM Inference Server on Modal



In this video, Modal Labs AI Engineer Charles Frye walks through the process for running an auto-scaling OpenAI-compatible LLM inference server on Modal using vLLM, scaling up from 1 to 100 to 1000 concurrent users and 30,000 tokens per second in just minutes.

Read the guide in the Modal docs: https://modal.com/docs/examples/vllm_inference

Run the code yourself: https://github.com/modal-labs/modal-examples/blob/main/06_gpu_and_ml/llm-serving/vllm_inference.py

Sign up for Modal: https://modal.com/signup

00:00 Opener
00:22 Intro and overview
01:01 What is Modal?
01:45 OpenAI-compatible inference services
04:45 Interacting with Inference services on Modal
07:30 Defining the environment with Modal Images and Volumes
14:08 Deploying vLLM in OpenAI-compatible mode with FastAPI on Modal
19:06 OpenAPI docs for your OpenAI API
20:05 Load-testing a Modal app with a Modal app
22:52 Auto-scaling to 100 simultaneous users
23:52 What is the load we’re testing?
25:05 Auto-scaling to 1000 simultaneous users
29:54 Load-test results for 1000 users
33:16 Q&A session
43:39 Outro

#MadeByGoogle ‘24: Keynote

This Political Startup Wants to Help Progressives Win … With AI-Generated Ads

This Political Startup Wants to Help Progressives Win … With AI-Generated Ads