Giant Language Fashions (LLMs) proceed to soar in recognition as a brand new one is launched practically each week. With the variety of these fashions growing, so are the choices for the way we will host them. In my earlier article we explored how we may make the most of DJL Serving inside Amazon SageMaker to effectively host LLMs. On this article we discover one other optimized mannequin server and resolution in HuggingFace Text Generation Inference (TGI).
NOTE: For these of you new to AWS, be sure you make an account on the following link if you wish to observe alongside. The article additionally assumes an intermediate understanding of SageMaker Deployment, I’d recommend following this article for understanding Deployment/Inference extra in depth.
DISCLAIMER: I’m a Machine Studying Architect at AWS and my opinions are my very own.
Why HuggingFace Textual content Era Inference? How Does It Work With Amazon SageMaker?
TGI is a Rust, Python, gRPC mannequin server created by HuggingFace that can be utilized to host particular massive language fashions. HuggingFace has lengthy been the central hub for NLP and it comprises a big set of optimizations on the subject of LLMs particularly, look under for a number of and the documentation for an intensive checklist.
- Tensor Parallelism for environment friendly internet hosting throughout a number of GPUs
- Token Streaming with SSE
- Quantization with bitsandbytes
- Logits warper (completely different params similar to temperature, top-k, top-n, and many others)
A big constructive of this resolution that I famous is the simplicity of use. TGI at this second helps the next optimized mannequin architectures which you can straight deploy using the TGI containers.