In this video, Modal Labs AI Engineer Charles Frye walks through the process for running an auto-scaling OpenAI-compatible LLM inference server on Modal using vLLM, scaling up from 1 to 100 to 1000 concurrent users and 30,000 tokens per second in just minutes.
Read the guide in the Modal docs: https://modal.com/docs/examples/vllm_inference
Run the code yourself: https://github.com/modal-labs/modal-examples/blob/main/06_gpu_and_ml/llm-serving/vllm_inference.py
Sign up for Modal: https://modal.com/signup
00:00 Opener
00:22 Intro and overview
01:01 What is Modal?
01:45 OpenAI-compatible inference services
04:45 Interacting with Inference services on Modal
07:30 Defining the environment with Modal Images and Volumes
14:08 Deploying vLLM in OpenAI-compatible mode with FastAPI on Modal
19:06 OpenAPI docs for your OpenAI API
20:05 Load-testing a Modal app with a Modal app
22:52 Auto-scaling to 100 simultaneous users
23:52 What is the load we’re testing?
25:05 Auto-scaling to 1000 simultaneous users
29:54 Load-test results for 1000 users
33:16 Q&A session
43:39 Outro