Maximize Steady Diffusion efficiency and decrease inference prices with AWS Inferentia2

Generative AI fashions have been experiencing speedy development in current months as a consequence of its spectacular capabilities in creating lifelike textual content, photographs, code, and audio. Amongst these fashions, Steady Diffusion fashions stand out for his or her distinctive power in creating high-quality photographs primarily based on textual content prompts. Steady Diffusion can generate all kinds of high-quality photographs, together with lifelike portraits, landscapes, and even summary artwork. And, like different generative AI fashions, Steady Diffusion fashions require highly effective computing to offer low-latency inference.

On this submit, we present how one can run Steady Diffusion fashions and obtain excessive efficiency on the lowest value in Amazon Elastic Compute Cloud (Amazon EC2) utilizing Amazon EC2 Inf2 instances powered by AWS Inferentia2. We take a look at the structure of a Steady Diffusion mannequin and stroll by means of the steps of compiling a Steady Diffusion mannequin utilizing AWS Neuron and deploying it to an Inf2 occasion. We additionally focus on the optimizations that the Neuron SDK robotically makes to enhance efficiency. You may run each Steady Diffusion 2.1 and 1.5 variations on AWS Inferentia2 cost-effectively. Lastly, we present how one can deploy a Steady Diffusion mannequin to an Inf2 occasion with Amazon SageMaker.

The Steady Diffusion 2.1 mannequin dimension in floating level 32 (FP32) is 5 GB and a couple of.5 GB in bfoat16 (BF16). A single inf2.xlarge occasion has one AWS Inferentia2 accelerator with 32 GB of HBM reminiscence. The Steady Diffusion 2.1 mannequin can match on a single inf2.xlarge occasion. Steady Diffusion is a text-to-image mannequin that you should utilize to create photographs of various types and content material just by offering a textual content immediate as an enter. To study extra in regards to the Steady Diffusion mannequin structure, seek advice from Create high-quality images with Stable Diffusion models and deploy them cost-efficiently with Amazon SageMaker.

How the Neuron SDK optimizes Steady Diffusion efficiency

Earlier than we will deploy the Steady Diffusion 2.1 mannequin on AWS Inferentia2 situations, we have to compile the mannequin parts utilizing the Neuron SDK. The Neuron SDK, which features a deep studying compiler, runtime, and instruments, compiles and robotically optimizes deep studying fashions to allow them to run effectively on Inf2 situations and extract full efficiency of the AWS Inferentia2 accelerator. We have now examples accessible for Steady Diffusion 2.1 mannequin on the GitHub repo. This pocket book presents an end-to-end instance of the way to compile a Steady Diffusion mannequin, save the compiled Neuron fashions, and cargo it into the runtime for inference.

We use StableDiffusionPipeline from the Hugging Face diffusers library to load and compile the mannequin. We then compile all of the parts of the mannequin for Neuron utilizing torch_neuronx.hint() and save the optimized mannequin as TorchScript. Compilation processes may be fairly memory-intensive, requiring a major quantity of RAM. To bypass this, earlier than tracing every mannequin, we create a deepcopy of the a part of the pipeline that’s being traced. Following this, we delete the pipeline object from reminiscence utilizing del pipe. This system is especially helpful when compiling on situations with low RAM.

Moreover, we additionally carry out optimizations to the Steady Diffusion fashions. UNet holds probably the most computationally intensive side of the inference. The UNet part operates on enter tensors which have a batch dimension of two, producing a corresponding output tensor additionally with a batch dimension of two, to provide a single picture. The weather inside these batches are totally unbiased of one another. We are able to make the most of this habits to get optimum latency by working one batch on every Neuron core. We compile the UNet for one batch (by utilizing enter tensors with one batch), then use the torch_neuronx.DataParallel API to load this single batch mannequin onto every core. The output of this API is a seamless two-batch module: we will move to the UNet the inputs of two batches, and a two-batch output is returned, however internally, the 2 single-batch fashions are working on the 2 Neuron cores. This technique optimizes useful resource utilization and reduces latency.

Compile and deploy a Steady Diffusion mannequin on an Inf2 EC2 occasion

To compile and deploy the Steady Diffusion mannequin on an Inf2 EC2 occasion, signal to the AWS Management Console and create an inf2.8xlarge occasion. Notice that an inf2.8xlarge occasion is required just for the compilation of the mannequin as a result of compilation requires the next host reminiscence. The Steady Diffusion mannequin may be hosted on an inf2.xlarge occasion. You’ll find the newest AMI with Neuron libraries utilizing the next AWS Command Line Interface (AWS CLI) command:

aws ec2 describe-images --region us-east-1 --owners amazon 
--filters 'Title=title,Values=Deep Studying AMI Neuron PyTorch 1.13.? (Amazon Linux 2) ????????' 'Title=state,Values=accessible' 
--query 'reverse(sort_by(Photographs, &CreationDate))[:1].ImageId' 
--output textual content

For this instance, we created an EC2 occasion utilizing the Deep Studying AMI Neuron PyTorch 1.13 (Ubuntu 20.04). You may then create a JupyterLab lab setting by connecting to the occasion and working the next steps:

run supply /choose/aws_neuron_venv_pytorch/bin/activate
pip set up jupyterlab

A pocket book with all of the steps for compiling and internet hosting the mannequin is positioned on GitHub.

Let’s take a look at the compilation steps for one of many textual content encoder blocks. Different blocks which are a part of the Steady Diffusion pipeline may be compiled equally.

Step one is to load the pre-trained mannequin from Hugging Face. The StableDiffusionPipeline.from_pretrained methodology hundreds the pre-trained mannequin into our pipeline object, pipe. We then create a deepcopy of the textual content encoder from our pipeline, successfully cloning it. The del pipe command is then used to delete the unique pipeline object, liberating up the reminiscence that was consumed by it. Right here, we’re quantizing the mannequin to BF16 weights:

model_id = "stabilityai/stable-diffusion-2-1-base"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
text_encoder = copy.deepcopy(pipe.text_encoder)
del pipe

This step includes wrapping our textual content encoder with the NeuronTextEncoder wrapper. The output of a compiled textual content encoder module might be of dict. We convert it to a record kind utilizing this wrapper:

text_encoder = NeuronTextEncoder(text_encoder)

We initialize PyTorch tensor emb with some values. The emb tensor is used as instance enter for the torch_neuronx.hint operate. This operate traces our textual content encoder and compiles it right into a format optimized for Neuron. The listing path for the compiled mannequin is constructed by becoming a member of COMPILER_WORKDIR_ROOT with the subdirectory text_encoder:

emb = torch.tensor([...])
text_encoder_neuron = torch_neuronx.hint(
        emb, part of(COMPILER_WORKDIR_ROOT, 'text_encoder'),

The compiled textual content encoder is saved utilizing It’s saved underneath the file title within the text_encoder listing of our compiler’s workspace:

text_encoder_filename = part of(COMPILER_WORKDIR_ROOT, 'text_encoder/'), text_encoder_filename)

The notebook consists of related steps to compile different parts of the mannequin: UNet, VAE decoder, and VAE post_quant_conv. After you may have compiled all of the fashions, you may load and run the mannequin following these steps:

  1. Outline the paths for the compiled fashions.
  2. Load a pre-trained StableDiffusionPipeline mannequin, with its configuration specified to make use of the bfloat16 information kind.
  3. Load the UNet mannequin onto two Neuron cores utilizing the torch_neuronx.DataParallel API. This enables information parallel inference to be carried out, which might considerably pace up mannequin efficiency.
  4. Load the remaining elements of the mannequin (text_encoder, decoder, and post_quant_conv) onto a single Neuron core.

You may then run the pipeline by offering enter textual content as prompts. The next are some photos generated by the mannequin for the prompts:

  • Portrait of renaud sechan, pen and ink, intricate line drawings, by craig mullins, ruan jia, kentaro miura, greg rutkowski, loundraw

  • Portrait of outdated coal miner in nineteenth century, lovely portray, with extremely detailed face portray by greg rutkowski

  • A citadel in the midst of a forest

Host Steady Diffusion 2.1 on AWS Inferentia2 and SageMaker

Internet hosting Steady Diffusion fashions with SageMaker additionally requires compilation with the Neuron SDK. You may full the compilation forward of time or throughout runtime utilizing Massive Mannequin Inference (LMI) containers. Compilation forward of time permits for quicker mannequin loading occasions and is the popular possibility.

SageMaker LMI containers present two methods to deploy the mannequin:

  • A no-code possibility the place we simply present a file with the required configurations
  • Convey your individual inference script

We take a look at each options and go over the configurations and the inference script ( On this submit, we exhibit the deployment utilizing a pre-compiled mannequin saved in an Amazon Simple Storage Service (Amazon S3) bucket. You should use this pre-compiled mannequin to your deployments.

Configure the mannequin with a offered script

On this part, we present the way to configure the LMI container to host the Steady Diffusion fashions. The SD2.1 pocket book accessible on GitHub. Step one is to create the mannequin configuration bundle per the next listing construction. Our goal is to make use of the minimal mannequin configurations wanted to host the mannequin. The listing construction wanted is as follows:

<config-root-directory> / 
    └── [OPTIONAL]

Subsequent, we create the file with the next parameters:

%%writefile code_sd/

The parameters specify the next:

  • possibility.model_id – The LMI containers use s5cmd to load the mannequin from the S3 location and subsequently we have to specify the placement of the place our compiled weights are.
  • possibility.entryPoint – To make use of the built-in handlers, we specify the transformers-neuronx class. In case you have a customized inference script, you’ll want to present that as an alternative.
  • possibility.dtype – This specifies to load the weights in a selected dimension. For this submit, we use BF16, which additional reduces our reminiscence necessities vs. FP32 and lowers our latency as a consequence of that.
  • possibility.tensor_parallel_degree – This parameter specifies the variety of accelerators we use for this mannequin. The AWS Inferentia2 chip accelerator has two Neuron cores and so specifying a worth of two means we use one accelerator (two cores). This implies we will now create a number of staff to extend the throughput of the endpoint.
  • possibility.engine – That is set to Python to point we is not going to be utilizing different compilers like DeepSpeed or Sooner Transformer for this internet hosting.

Convey your individual script

If you wish to deliver your individual customized inference script, you’ll want to take away the possibility.entryPoint from The LMI container in that case will search for a file in the identical location because the and use that to run the inferencing.

Create your individual inference script (

Creating your individual inference script is comparatively easy utilizing the LMI container. The container requires your file to have an implementation of the next methodology:

def deal with(inputs: Enter) which returns an object of kind Outputs

Let’s study a few of the essential areas of the attached notebook, which demonstrates the deliver your individual script operate.

Substitute the cross_attention module with the optimized model:

# Substitute unique cross-attention module with customized cross-attention module for higher efficiency
    CrossAttention.get_attention_scores = get_attention_scores
Load the compiled weights for the next
text_encoder_filename = part of(COMPILER_WORKDIR_ROOT, '')
decoder_filename = part of(COMPILER_WORKDIR_ROOT, '')
unet_filename = part of(COMPILER_WORKDIR_ROOT, '')
post_quant_conv_filename =. part of(COMPILER_WORKDIR_ROOT, '')

These are the names of the compiled weights file we used when creating the compilations. Be at liberty to alter the file names, however make sure that your weights file names match what you specify right here.

Then we have to load them utilizing the Neuron SDK and set these within the precise mannequin weights. When loading the UNet optimized weights, observe we’re additionally specifying the variety of Neuron cores we have to load these onto. Right here, we load to a single accelerator with two cores:

# Load the compiled UNet onto two neuron cores.
    pipe.unet = NeuronUNet(UNetWrap(pipe.unet))"Loading mannequin: unet:created")
    device_ids = [idx for idx in range(tensor_parallel_degree)]
    pipe.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_filename), device_ids, set_dynamic_batching=False)
    # Load different compiled fashions onto a single neuron core.
    # - load encoders
    pipe.text_encoder = NeuronTextEncoder(pipe.text_encoder)
    clip_compiled = torch.jit.load(text_encoder_filename)
    pipe.text_encoder.neuron_text_encoder = clip_compiled
    #- load decoders
    pipe.vae.decoder = torch.jit.load(decoder_filename)
    pipe.vae.post_quant_conv = torch.jit.load(post_quant_conv_filename)

Working the inference with a immediate invokes the pipe object to generate a picture.

Create the SageMaker endpoint

We use Boto3 APIs to create a SageMaker endpoint. Full the next steps:

  1. Create the tarball with simply the serving and the elective recordsdata and add it to Amazon S3.
  2. Create the mannequin utilizing the picture container and the mannequin tarball uploaded earlier.
  3. Create the endpoint config utilizing the next key parameters:
    1. Use an ml.inf2.xlarge occasion.
    2. Set ContainerStartupHealthCheckTimeoutInSeconds to 240 to make sure the well being verify begins after the mannequin is deployed.
    3. Set VolumeInGB to a bigger worth so it may be used for loading the mannequin weights which are 32 GB in dimension.

Create a SageMaker mannequin

After you create the mannequin.tar.gz file and add it to Amazon S3, we have to create a SageMaker mannequin. We use the LMI container and the mannequin artifact from the earlier step to create the SageMaker mannequin. SageMaker permits us to customise and inject varied setting variables. For this workflow, we will go away all the pieces as default. See the next code:

inference_image_uri = (
    f"763104351884.dkr.ecr.{area} djl-serving-inf2"

Create the mannequin object, which primarily creates a lockdown container that’s loaded onto the occasion and used for inferencing:

model_name = name_from_base(f"inf2-sd")
create_model_response = boto3_sm_client.create_model(
    PrimaryContainer={"Picture": inference_image_uri, "ModelDataUrl": s3_code_artifact},

Create a SageMaker endpoint

On this demo, we use an ml.inf2.xlarge occasion. We have to set the VolumeSizeInGB parameters to offer the mandatory disk area to load the mannequin and the weights. This parameter is relevant to situations supporting the Amazon Elastic Block Store (Amazon EBS) quantity attachment. We are able to go away the mannequin obtain timeout and container startup well being verify to the next worth, which is able to give satisfactory time for the container to tug the weights from Amazon S3 and cargo into the AWS Inferentia2 accelerators. For extra particulars, seek advice from CreateEndpointConfig.

endpoint_config_response = boto3_sm_client.create_endpoint_config(

            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.inf2.xlarge", # - 
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 360, 
            "VolumeSizeInGB": 400

Lastly, we create a SageMaker endpoint:

create_endpoint_response = boto3_sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name

Invoke the mannequin endpoint

This can be a generative mannequin, so we move within the immediate that the mannequin makes use of to generate the picture. The payload is of the sort JSON:

response_model = boto3_sm_run_client.invoke_endpoint(

            "immediate": "Mountain Panorama", 
            "parameters": {} # 

Benchmarking the Steady Diffusion mannequin on Inf2

We ran a number of exams to benchmark the Steady Diffusion mannequin with BF 16 information kind on Inf2, and we’re in a position to derive latency numbers that rival or exceed a few of the different accelerators for Steady Diffusion. This, coupled with the decrease value of AWS Inferentia2 chips, makes this a particularly beneficial proposition.

The next numbers are from the Steady Diffusion mannequin deployed on an inf2.xl occasion. For extra details about prices, seek advice from Amazon EC2 Inf2 Instances.

Mannequin Decision Information kind Iterations P95 Latency (ms) Inf2.xl On-Demand value per hour Inf2.xl (Value per picture)
Steady Diffusion 1.5 512×512 bf16 50 2,427.4 $0.76 $0.0005125
Steady Diffusion 1.5 768×768 bf16 50 8,235.9 $0.76 $0.0017387
Steady Diffusion 1.5 512×512 bf16 30 1,456.5 $0.76 $0.0003075
Steady Diffusion 1.5 768×768 bf16 30 4,941.6 $0.76 $0.0010432
Steady Diffusion 2.1 512×512 bf16 50 1,976.9 $0.76 $0.0004174
Steady Diffusion 2.1 768×768 bf16 50 6,836.3 $0.76 $0.0014432
Steady Diffusion 2.1 512×512 bf16 30 1,186.2 $0.76 $0.0002504
Steady Diffusion 2.1 768×768 bf16 30 4,101.8 $0.76 $0.0008659


On this submit, we dove deep into the compilation, optimization, and deployment of the Steady Diffusion 2.1 mannequin utilizing Inf2 situations. We additionally demonstrated deployment of Steady Diffusion fashions utilizing SageMaker. Inf2 situations additionally ship nice worth efficiency for Steady Diffusion 1.5. To study extra about why Inf2 situations are nice for generative AI and enormous language fashions, seek advice from Amazon EC2 Inf2 Instances for Low-Cost, High-Performance Generative AI Inference are Now Generally Available. For efficiency particulars, seek advice from Inf2 Performance. Try further examples on the GitHub repo.

Particular because of Matthew Mcclain, Beni Hegedus, Kamran Khan, Shruti Koparkar, and Qing Lan for reviewing and offering beneficial inputs.

Concerning the Authors

Vivek Gangasani is a Senior Machine Studying Options Architect at Amazon Net Companies. He works with machine studying startups to construct and deploy AI/ML functions on AWS. He’s presently targeted on delivering options for MLOps, ML inference, and low-code ML. He has labored on initiatives in several domains, together with pure language processing and pc imaginative and prescient.

Ok.C. Tung is a Senior Answer Architect in AWS Annapurna Labs. He makes a speciality of giant deep studying mannequin coaching and deployment at scale in cloud. He has a Ph.D. in molecular biophysics from the College of Texas Southwestern Medical Heart in Dallas. He has spoken at AWS Summits and AWS Reinvent. At present he helps prospects to coach and deploy giant PyTorch and TensorFlow fashions in AWS cloud. He’s the writer of two books: Learn TensorFlow Enterprise and TensorFlow 2 Pocket Reference.

Rupinder Grewal is a Sr Ai/ML Specialist Options Architect with AWS. He presently focuses on serving of fashions and MLOps on SageMaker. Previous to this position he has labored as Machine Studying Engineer constructing and internet hosting fashions. Exterior of labor he enjoys taking part in tennis and biking on mountain trails.

Routinely detecting label errors in datasets with CleanLab | by João Pedro | Jul, 2023

Flag dangerous language in spoken conversations with Amazon Transcribe Toxicity Detection