Deploy Falcon-40B with giant mannequin inference DLCs on Amazon SageMaker

Final week, Technology Innovation Institute (TII) launched TII Falcon LLM, an open-source foundational giant language mannequin (LLM). Skilled on 1 trillion tokens with Amazon SageMaker, Falcon boasts top-notch efficiency (#1 on the Hugging Face leaderboard at time of writing) whereas being comparatively light-weight and cheaper to host than different LLMs corresponding to llama-65B. On this publish, we exhibit easy methods to deploy Falcon for functions like language understanding and automatic writing help utilizing giant mannequin inference deep studying containers on SageMaker.

The Falcon has landed on SageMaker

TII is the utilized analysis group inside Abu Dhabi’s Advanced Technology Research Council; its crew of scientists, researchers, and engineers is devoted to the invention of transformative applied sciences and growth of scientific breakthroughs that can future-proof our society. Earlier this yr, TII got down to prepare a state-of-the-art, open-source LLM and used the infrastructure, tooling, and experience of SageMaker to get the job accomplished (to study extra about how this mannequin was skilled on SageMaker, check with Technology Innovation Institute trains the state-of-the-art Falcon LLM 40B foundation model on Amazon SageMaker). The results of this effort is TII Falcon LLM.

Skilled on 1 trillion tokens, Falcon boasts top-notch efficiency in opposition to the Eleuther AI Language Model Evaluation Harness and is presently #1 on the Hugging Face leaderboard for accuracy. The mannequin is offered in two completely different sizes—Falcon-40B and Falcon-7B—and can be utilized for state-of-the-art efficiency in functions corresponding to language understanding, conversational experiences, and automatic writing help. This publish will show you how to get began with deploying Falcon on SageMaker for high-accuracy inference in these kinds of domains.

SageMaker giant mannequin inference DLCs simplify LLM internet hosting

Internet hosting LLMs corresponding to Falcon-40B and Falcon-7B might be difficult. Bigger fashions are sometimes extra correct as a result of they embrace billions of parameters, however their measurement also can end in slower inference latency or worse throughput. Internet hosting an LLM can require extra GPU reminiscence and optimized kernels to realize acceptable efficiency. To additional complicate issues, though smaller fashions corresponding to Falcon-7B can usually match on a single GPU corresponding to an NVIDIA A10G occasion that powers AWS G5 occasion varieties, bigger fashions like Falcon-40B can not. When this occurs, methods corresponding to tensor parallelism should be used to shard that bigger mannequin into a number of items and benefit from the reminiscence of a number of GPUs. Legacy internet hosting options used for smaller fashions usually don’t provide this kind of performance, including to the problem.

SageMaker giant mannequin inference (LMI) deep studying containers (DLCs) may also help. LMI DLCs are a whole end-to-end answer for internet hosting LLMs like Falcon-40B. On the entrance finish, they embrace a high-performance mannequin server (DJL Serving) designed for giant mannequin inference with options corresponding to token streaming and automated mannequin replication inside an occasion to extend throughput. On the backend, LMI DLCs additionally embrace a number of high-performance mannequin parallel engines, corresponding to DeepSpeed and FasterTransformer, that may shard and handle mannequin parameters throughout a number of GPUs. These engines additionally embrace optimized kernels for fashionable transformer fashions, which might speed up inference by as much as thrice sooner. With LMI DLCs, you merely have to create a configuration file to get began with LLM internet hosting on SageMaker. To study extra about SageMaker LMI DLCs, check with Model parallelism and large model inference and our list of available images. You can even try our earlier publish about internet hosting Bloom-175B on SageMaker utilizing LMI DLCs.

Answer overview

This publish walks you thru easy methods to host Falcon-40B utilizing DeepSpeed on SageMaker utilizing LMI DLCs. Falcon-40B requires that we use a number of A10 GPUs, whereas Falcon-7B solely requires a single GPU. Now we have additionally ready examples you possibly can reference to host Falcon-40B and Falcon-7B utilizing each DeepSpeed and Speed up. You could find our code examples on GitHub.

This instance might be run in SageMaker pocket book situations or Amazon SageMaker Studio notebooks. For internet hosting Falcon-40B utilizing LMI and DeepSpeed, we have to use an ml.g5.24xlarge occasion. These situations present 4x NVIDIA A10G GPUs, which every help 96 GiB of GPU reminiscence. As well as, the host gives 96 vCPUs and 384 GiB of host reminiscence. The LMI container will assist handle a lot of the undifferentiated heavy lifting related to internet hosting LLMs, together with downloading the mannequin and partitioning the mannequin artifact in order that its comprising parameters might be unfold throughout a number of GPUs.

Quotas for SageMaker machine studying (ML) situations can differ between accounts. In case you obtain an error indicating you’ve exceeded your quota for g5.24xlarge situations whereas following this publish, you possibly can enhance the restrict by the Service Quotas console.

Pocket book walkthrough

To start, we begin by putting in and importing the mandatory dependencies for our instance. We use the Boto3 SDK in addition to the SageMaker SDK. Observe that we use Amazon Simple Storage Service (Amazon S3) to retailer the mannequin artifacts that we want for SageMaker and LMI to make use of, so we arrange an S3 prefix variable accordingly. See the next code:

import sagemaker
import jinja2
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path
from sagemaker.utils import name_from_base

position = sagemaker.get_execution_role()  # execution position for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with completely different AWS APIs
bucket = sess.default_bucket()  # bucket to deal with artifacts
model_bucket = sess.default_bucket()  # bucket to deal with artifacts
s3_code_prefix_deepspeed = "hf-large-model-djl-/code_falcon40b/deepspeed"  # folder inside bucket the place code artifact will go
area = sess._region_name
account_id = sess.account_id()
s3_client = boto3.consumer("s3")
sm_client = boto3.consumer("sagemaker")
smr_client = boto3.consumer("sagemaker-runtime")
jinja_env = jinja2.Surroundings()

We then create a neighborhood folder for our workspace to retailer our mannequin artifacts:

!mkdir -p code_falcon40b_deepspeed

We first create a configuration file within the native listing we created. This file signifies to the LMI container and the front-end DJL Serving library which mannequin parallelization and inference optimization engine we wish to use. You could find the configuration choices for each DeepSpeed and Hugging Face Speed up in Configurations and settings. Right here, word that we set the choice.model_id parameter to outline which Hugging Face mannequin to tug from. SageMaker makes working with Hugging Face fashions easy, and this one line is all you want. As well as, we set choice.tensor_parallel_degree to a price of 4 as a result of we’ve 4 GPUs on our ml.g5.24xlarge occasion. This parameter defines what number of partitions of the mannequin to create and distribute. Observe that if we had used a bigger occasion with eight GPUs, corresponding to ml.g5.48xlarge, and nonetheless set a price of 4, then LMI would mechanically create two replicas of the mannequin (two replicas unfold throughout 4 GPUs every). See the next code:

%%writefile ./code_falcon40b_deepspeed/
#to deploy falcon-40b-instruct set the model_id worth to 'tiiuae/falcon-40b-instruct'
#choice.s3url = {{s3url}}

You can even swap out tiiuae/falcon-40b with tiiuae/falcon-40b-instruct if it fits your wants higher.

We additionally embrace a necessities.txt file that you could specify to put in packages that you just require:

%%writefile ./code_falcon40b_deepspeed/necessities.txt

The very last thing we want is the file that might be used along with your mannequin:

%%writefile ./code_falcon40b_deepspeed/
from djl_python import Enter, Output
import os
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from typing import Any, Dict, Tuple
import warnings

predictor = None

def get_model(properties):
    model_name = properties["model_id"]
    local_rank = int(os.getenv("LOCAL_RANK", "0"))
    mannequin = AutoModelForCausalLM.from_pretrained(
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    generator = pipeline(
        activity="text-generation", mannequin=mannequin, tokenizer=tokenizer, device_map="auto"
    return generator

def deal with(inputs: Enter) -> None:
    world predictor
    if not predictor:
        predictor = get_model(inputs.get_properties())
    if inputs.is_empty():
        # Mannequin server makes an empty name to warmup the mannequin on startup
        return None
    information = inputs.get_as_json()
    textual content = information["text"]
    text_length = information["text_length"]
    outputs = predictor(textual content, do_sample=True, min_length=text_length, max_length=text_length)
    end result = {"outputs": outputs}
    return Output().add_as_json(end result)

That’s it! At this level, we’ve created all of the artifacts you will have deploy Falcon-40B with DeepSpeed! We bundle the listing right into a *.tar.gz file and add it to Amazon S3. Observe that the precise mannequin has not been downloaded or packaged into this file. The LMI container will obtain the mannequin for you from Hugging Face instantly. You even have the choice to focus on an S3 bucket if you want your personal copy of the mannequin in a location that might be extra performant to obtain. LMI additionally consists of optimization for downloading from Amazon S3 with excessive efficiency. See the next code:

s3_code_artifact_deepspeed= sess.upload_data("mannequin.tar.gz", bucket, s3_code_prefix_deepspeed)
print(f"S3 Code or Mannequin tar for deepspeed uploaded to --- > {s3_code_artifact_deepspeed}")

All that’s left to do at this level is to outline the container we wish to use and create a mannequin object:

inference_image_uri = (
model_name_acc = name_from_base(f"falcon40b-model-ds")
create_model_response = sm_client.create_model(
    PrimaryContainer={"Picture": inference_image_uri, "ModelDataUrl": s3_code_artifact_deepspeed},
model_arn = create_model_response["ModelArn"]

Then we create an endpoint configuration and create the endpoint:

endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"
endpoint_config_response = sm_client.create_endpoint_config(
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.24xlarge",
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
            # "VolumeSizeInGB": 512

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Configuration objects to remember for profitable internet hosting

An essential consideration for giant mannequin internet hosting is guaranteeing there may be ample time for mannequin obtain from Hugging Face. In our assessments, the Falcon-40B took about 90 minutes to obtain onto the occasion. A key set of configurations to permit for this are ContainerStartupHealthCheckTimeoutInSeconds and ModelDataDownloadTimeoutInSeconds. Make certain the SageMaker endpoint configuration has a price of 3600 for every of those. Moreover, it’s a lot simpler to obtain from Amazon S3 as a substitute of the unique mannequin zoo utilizing the LMI containers which are specifically designed for LLMS that use the S5cmd utility, which cuts the mannequin obtain time to round 10 minutes.

You’ll be able to monitor the standing of the endpoint by calling DescribeEndpoint, which can let you know when all the pieces is full. Your endpoint is now prepared to answer inference requests! As a result of LMI handles the mannequin partitioning and orchestration for you, every request might be processed utilizing all 4 GPUs accessible on our ml.g5.12xlarge occasion. This enables us to host LLMs and enhance efficiency in case you scale GPU accelerators horizontally. See the next code:

response_model = smr_client.invoke_endpoint(
    Physique=json.dumps({"textual content": "What's the objective of life?", "text_length": 150}),


In case you are accomplished and want to delete the endpoint configuration, endpoint, and mannequin object, you possibly can run the next instructions:


This code we referenced on this publish might be discovered within the full notebook on GitHub.


SageMaker Internet hosting and the LMI DLC makes it simple so that you can host LLMs like Falcon-40B. It takes on the undifferentiated heavy lifting in orchestrating what’s required to host fashions throughout a number of GPUs and gives configurable choices to fit your wants. As well as, utilizing Hugging Face fashions turns into very easy, with built-in help for these fashions.

On this publish, we confirmed how you need to use SageMaker to host the Falcon-40B mannequin utilizing DeepSpeed. As well as, we supplied examples in GitHub to host Falcon-40B utilizing Speed up, and the smaller Falcon-7B fashions. We encourage you to present this a strive on SageMaker with LMI and get hands-on with the best-performing publicly accessible LLM to this point!

Concerning the authors

James Park is a Options Architect at Amazon Internet Providers. He works with to design, construct, and deploy expertise options on AWS, and has a selected curiosity in AI and machine studying. In h is spare time he enjoys looking for out new cultures, new experiences,  and staying updated with the most recent expertise developments.You could find him on LinkedIn.

Abhi Shivaditya is a Senior Options Architect at AWS, working with strategic world enterprise organizations to facilitate the adoption of AWS providers in areas corresponding to Synthetic Intelligence, distributed computing, networking, and storage. His experience lies in Deep Studying within the domains of Pure Language Processing (NLP) and Pc Imaginative and prescient. Abhi assists clients in deploying high-performance machine studying fashions effectively inside the AWS ecosystem.

Robert Van Dusen is a Senior Product Supervisor with Amazon SageMaker. He leads deep studying mannequin optimization for functions corresponding to giant mannequin inference.

Evandro Franco is an AI/ML Specialist Options Architect engaged on Amazon Internet Providers. He helps AWS clients overcome enterprise challenges associated to AI/ML on high of AWS. He has greater than 15 years working with expertise, from software program growth, infrastructure, serverless, to machine studying.

Qing Lan is a Software program Improvement Engineer in AWS. He has been engaged on a number of difficult merchandise in Amazon, together with excessive efficiency ML inference options and excessive efficiency logging system. Qing’s crew efficiently launched the primary Billion-parameter mannequin in Amazon Promoting with very low latency required. Qing has in-depth data on the infrastructure optimization and Deep Studying acceleration.

Frank Liu is a Software program Engineer for AWS Deep Studying. He focuses on constructing revolutionary deep studying instruments for software program engineers and scientists. In his spare time, he enjoys climbing with family and friends.

AWS Inferentia2 builds on AWS Inferentia1 by delivering 4x increased throughput and 10x decrease latency

Construct customized chatbot purposes utilizing OpenChatkit fashions on Amazon SageMaker