Running a High Throughput OpenAI-Compatible vLLM Inference Server on Modal

In this video, Modal Labs AI Engineer Charles Frye walks through the process for running an auto-scaling OpenAI-compatible LLM inference server on Modal using vLLM, scaling up from 1 to 100 to 1000 concurrent users and 30,000 tokens per second in just minutes.

Read the guide in the Modal docs: https://modal.com/docs/examples/vllm_inference

Run the code yourself: https://github.com/modal-labs/modal-examples/blob/main/06_gpu_and_ml/llm-serving/vllm_inference.py

00:00 Opener
00:22 Intro and overview
01:01 What is Modal?
01:45 OpenAI-compatible inference services
04:45 Interacting with Inference services on Modal
07:30 Defining the environment with Modal Images and Volumes
14:08 Deploying vLLM in OpenAI-compatible mode with FastAPI on Modal
19:06 OpenAPI docs for your OpenAI API
20:05 Load-testing a Modal app with a Modal app
22:52 Auto-scaling to 100 simultaneous users
23:52 What is the load we’re testing?
25:05 Auto-scaling to 1000 simultaneous users
29:54 Load-test results for 1000 users
33:16 Q&A session
43:39 Outro

Running a High Throughput OpenAI-Compatible vLLM Inference Server on Modal

Box unveils unique AI pricing plan to account for high cost of running LLMs

Inference Engine Simplismart, Faster Than TogetherAI and FireworksAI, Has $7M

How to Build a Fake OpenAI Server (so you can automate finance stuff)

Pig Butchering Scams Are Going High Tech

AI-Powered Hitler Running Rampant Online

Accel Stakes e6data with $10M for High Performance Lakehouse Compute Engine

Quickly build and deploy OpenAI apps on Azure, infused with your own data

Building AGI in Real Time (OpenAI Dev Day 2024)

Machine Learning with OpenAI API and Relational Database (OpenAI, Python, SQLite)

Build an AI Chatbot With NestJS & Next.js | OpenAI Full Stack

OpenAI o1 for trading bots is unfair

AIPressRoom Exclusive | Assaf on Revolutionizing Research with GPT Researcher

AIPressRoom Exclusive | Thomas Bradley on Enhancing the Home Cooking Experience with Drizzlelemons

AIPressRoom Exclusive | Hermann on Transforming Tech Funding with PitchMastr

AIPressRoom Exclusive | How Butternut AI is Transforming the Future of Website Creation

AIPressRoom Exclusive | David Smith on Transforming Digital Documentation with EasyFill.ai

#MadeByGoogle ‘24: Keynote

Log In

With social network:

Or with username:

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections