As organizations deploy models to production, they are constantly looking for ways to optimize the performance of their foundation models (FMs) running on the latest accelerators, such as AWS Inferentia and GPUs, so they can reduce their costs and decrease response latency to provide the best experience to end-users. However, some FMs don’t fully utilize the accelerators available with the instances they’re deployed on, leading to an inefficient use of hardware resources. Some organizations deploy multiple FMs to the same instance to better utilize all of the available accelerators, but this requires complex infrastructure orchestration that is time consuming and difficult to manage. When multiple FMs share the same instance, each FM has its own scaling needs and usage patterns, making it challenging to predict when you need to add or remove instances. For example, one model may be used to power a user application where usage can spike during certain hours, whereas another model may have a more consistent usage pattern. In addition to optimizing costs, customers want to provide the best end-user experience by reducing latency. To do this, they often deploy multiple copies of a FM to field requests from users in parallel. Because FM outputs could range from a single sentence to multiple paragraphs, the time it takes to complete the inference request varies significantly, leading to unpredictable spikes in latency if the requests are routed randomly between instances. Amazon SageMaker now supports new inference capabilities that help you reduce deployment costs and latency.
You can now create inference component-based endpoints and deploy machine learning (ML) models to a SageMaker endpoint. An inference component (IC) abstracts your ML model and enables you to assign CPUs, GPU, or AWS Neuron accelerators, and scaling policies per model. Inference components offer the following benefits:
- SageMaker will optimally place and pack models onto ML instances to maximize utilization, leading to cost savings.
- SageMaker will scale each model up and down based on your configuration to meet your ML application requirements.
- SageMaker will scale to add and remove instances dynamically to ensure capacity is available while keeping idle compute to a minimum.
- You can scale down to zero copies of a model to free up resources for other models. You can also specify to keep important models always loaded and ready to serve traffic.
With these capabilities, you can reduce model deployment costs by 50% on average. The cost savings will vary depending on your workload and traffic patterns. Let’s take a simple example to illustrate how packing multiple models on a single endpoint can maximize utilization and save costs. Let’s say you have a chat application that helps tourists understand local customs and best practices built using two variants of Llama 2: one fine-tuned for European visitors and the other fine-tuned for American visitors. We expect traffic for the European model between 00:01–11:59 UTC and the American model between 12:00–23:59 UTC. Instead of deploying these models on their own dedicated instances where they will sit idle half the time, you can now deploy them on a single endpoint to save costs. You can scale down the American model to zero when it isn’t needed to free up capacity for the European model and vice versa. This allows you to utilize your hardware efficiently and avoid waste. This is a simple example using two models, but you can easily extend this idea to pack hundreds of models onto a single endpoint that automatically scales up and down with your workload.
In this post, we show you the new capabilities of IC-based SageMaker endpoints. We also walk you through deploying multiple models using inference components and APIs. Lastly, we detail some of the new observability capabilities and how to set up auto scaling policies for your models and manage instance scaling for your endpoints. You can also deploy models through our new simplified, interactive user experience. We also support advanced routing capabilities to optimize the latency and performance of your inference workloads.
Building blocks
Let’s take a deeper look and understand how these new capabilities work. The following is some new terminology for SageMaker hosting:
- Inference component – A SageMaker hosting object that you can use to deploy a model to an endpoint. You can create an inference component by supplying the following:
- The SageMaker model or specification of a SageMaker-compatible image and model artifacts.
- Compute resource requirements, which specify the needs of each copy of your model, including CPU cores, host memory, and number of accelerators.
- Model copy – A runtime copy of an inference component that is capable of serving requests.
- Managed instance auto scaling – A SageMaker hosting capability to scale up or down the number of compute instances used for an endpoint. Instance scaling reacts to the scaling of inference components.
To create a new inference component, you can specify a container image and a model artifact, or you can use SageMaker models that you may have already created. You also need to specify the compute resource requirements such as the number of host CPU cores, host memory, or the number of accelerators your model needs to run.
When you deploy an inference component, you can specify MinCopies
to ensure that the model is already loaded in the quantity that you require, ready to serve requests.
You also have the option to set your policies so that inference component copies scale to zero. For example, if you have no load running against an IC, the model copy will be unloaded. This can free up resources that can be replaced by active workloads to optimize the utilization and efficiency of your endpoint.
As inference requests increase or decrease, the number of copies of your ICs can also scale up or down based on your auto scaling policies. SageMaker will handle the placement to optimize the packing of your models for availability and cost.
In addition, if you enable managed instance auto scaling, SageMaker will scale compute instances according to the number of inference components that need to be loaded at a given time to serve traffic. SageMaker will scale up the instances and pack your instances and inference components to optimize for cost while preserving model performance. Although we recommend the use of managed instance scaling, you also have the option to manage the scaling yourself, should you choose to, through application auto scaling.
SageMaker will rebalance inference components and scale down the instances if they are no longer needed by inference components and save your costs.
Walkthrough of APIs
SageMaker has introduced a new entity called the InferenceComponent
. This decouples the details of hosting the ML model from the endpoint itself. The InferenceComponent
allows you to specify key properties for hosting the model like the SageMaker model you want to use or the container details and model artifacts. You also specify number of copies of the components itself to deploy, and number of accelerators (GPUs, Inf, or Trn accelerators) or CPU (vCPUs) required. This provides more flexibility for you to use a single endpoint for any number of models you plan to deploy to it in the future.
Let’s look at the Boto3 API calls to create an endpoint with an inference component. Note that there are some parameters that we address later in this post.
The following is example code for CreateEndpointConfig
:
The following is example code for CreateEndpoint
:
The following is example code for CreateInferenceComponent
:
This decoupling of InferenceComponent
to an endpoint provides flexibility. You can host multiple models on the same infrastructure, adding or removing them as your requirements change. Each model can be updated independently as needed. Additionally, you can scale models according to your business needs. InferenceComponent
also allows you to control capacity per model. In other words, you can determine how many copies of each model to host. This predictable scaling helps you meet the specific latency requirements for each model. Overall, InferenceComponent
gives you much more control over your hosted models.
In the following table, we show a side-by-side comparison of the high-level approach to creating and invoking an endpoint without InferenceComponent
and with InferenceComponent
. Note that CreateModel() is now optional for IC-based endpoints.
Step | Model-Based Endpoints | Inference Component-Based Endpoints |
1 | CreateModel(…) | CreateEndpointConfig(…) |
2 | CreateEndpointConfig(…) | CreateEndpoint(…) |
3 | CreateEndpoint(…) | CreateInferenceComponent(…) |
4 | InvokeEndpoint(…) | InvokeEndpoint(InferneceComponentName=’value’…) |
The introduction of InferenceComponent
allows you to scale at a model level. See Delve into instance and IC auto scaling for more details on how InferenceComponent
works with auto scaling.
When invoking the SageMaker endpoint, you can now specify the new parameter InferenceComponentName
to hit the desired InferenceComponentName
. SageMaker will handle routing the request to the instance hosting the requested InferenceComponentName
. See the following code:
By default, SageMaker uses random routing of the requests to the instances backing your endpoint. If you want to enable least outstanding requests routing, you can set the routing strategy in the endpoint config’s RoutingConfig
:
Least outstanding requests routing routes to the specific instances that have more capacity to process requests. This will provide more uniform load-balancing and resource utilization.
In addition to CreateInferenceComponent
, the following APIs are now available:
DescribeInferenceComponent
DeleteInferenceComponent
UpdateInferenceComponent
ListInferenceComponents
InferenceComponent logs and metrics
InferenceComponent
logs are located in /aws/sagemaker/InferenceComponents/<InferenceComponentName>
. All logs sent to stderr and stdout in the container are sent to these logs in Amazon CloudWatch.
With the introduction of IC-based endpoints, you now have the ability to view additional instance metrics, inference component metrics, and invocation metrics.
For SageMaker instances, you can now track the GPUReservation
and CPUReservation
metrics to see the resources reserved for an endpoint based on the inference components that you have deployed. These metrics can help you size your endpoint and auto scaling policies. You can also view the aggregate metrics associated with all models deployed to an endpoint.
SageMaker also exposes metrics at an inference component level, which can show a more granular view of the utilization of resources for the inference components that you have deployed. This allows you to get a view of how much aggregate resource utilization such as GPUUtilizationNormalized
and GPUMemoryUtilizationNormalized
for each inference component you have deployed that may have zero or many copies.
Lastly, SageMaker provides invocation metrics, which now tracks invocations for inference components aggregately (Invocations
) or per copy instantiated (InvocationsPerCopy
)
For a comprehensive list of metrics, refer to SageMaker Endpoint Invocation Metrics.
Model-level auto scaling
To implement the auto scaling behavior we described, when creating the SageMaker endpoint configuration and inference component, you define the initial instance count and initial model copy count, respectively. After you create the endpoint and corresponding ICs, to apply auto scaling at the IC level, you need to first register the scaling target and then associate the scaling policy to the IC.
When implementing the scaling policy, we use SageMakerInferenceComponentInvocationsPerCopy
, which is a new metric introduced by SageMaker. It captures the average number of invocations per model copy per minute.
After you set the scaling policy, SageMaker creates two CloudWatch alarms for each autoscaling target: one to trigger scale-out if in alarm for 3 minutes (three 1-minute data points) and one to trigger scale-in if in alarm for 15 minutes (15 1-minute data points), as shown in the following screenshot. The time to trigger the scaling action is usually 1–2 minutes longer than those minutes because it takes time for the endpoint to publish metrics to CloudWatch, and it also takes time for AutoScaling
to react. The cool-down period is the amount of time, in seconds, after a scale-in or scale-out activity completes before another scale-out activity can start. If the scale-out cool-down is shorter than that the endpoint update time, then it takes no effect, because it’s not possible to update a SageMaker endpoint when it is in Updating status.
Note that, when setting up IC-level auto scaling, you need to make sure the MaxInstanceCount
parameter is equal to or smaller than the maximum number of ICs this endpoint can handle. For example, if your endpoint is only configured to have one instance in the endpoint configuration and this instance can only host a maximum of four copies of the model, then the MaxInstanceCount
should be equal to or smaller than 4. However, you can also use the managed auto scaling capability provided by SageMaker to automatically scale the instance count based on the required model copy number to fulfil the need of more compute resources. The following code snippet demonstrates how to set up managed instance scaling during the creation of the endpoint configuration. This way, when the IC-level auto scaling requires more instance count to host the model copies, SageMaker will automatically scale out the instance number to allow the IC-level scaling to be successful.
You can apply multiple auto scaling policies against the same endpoint, which means you will be able to apply the traditional auto scaling policy to the endpoints created with ICs and scale up and down based on the other endpoint metrics. For more information, refer to Optimize your machine learning deployments with auto scaling on Amazon SageMaker. However, although this is possible, we still recommend using managed instance scaling over managing the scaling yourself.
Conclusion
In this post, we introduced a new feature in SageMaker inference that will help you maximize the utilization of compute instances, scale to hundreds of models, and optimize costs, while providing predictable performance. Furthermore, we provided a walkthrough of the APIs and showed you how to configure and deploy inference components for your workloads.
We also support advanced routing capabilities to optimize the latency and performance of your inference workloads. SageMaker can help you optimize your inference workloads for cost and performance and give you model-level granularity for management. We have created a set of notebooks that will show you how to deploy three different models, using different containers and applying auto scaling policies in GitHub. We encourage you to start with notebook 1 and get hands on with the new SageMaker hosting capabilities today!
About the authors
James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends. You can find him on LinkedIn.
Melanie Li, PhD, is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers build solutions using state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing ML solutions with best practices. In her spare time, she loves to explore nature and spend time with family and friends.
Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.
Alan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.
Raghu Ramesha is a Senior ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.
Rupinder Grewal is a Sr Ai/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on SageMaker. Prior to this role he has worked as Machine Learning Engineer building and hosting models. Outside of work he enjoys playing tennis and biking on mountain trails.
Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.
Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.
Lakshmi Ramakrishnan is a Principal Engineer at Amazon SageMaker Machine Learning (ML) platform team in AWS, providing technical leadership for the product. He has worked in several engineering roles in Amazon for over 9 years. He has a Bachelor of Engineering degree in Information Technology from National Institute of Technology, Karnataka, India and a Master of Science degree in Computer Science from the University of Minnesota Twin Cities.
David Nigenda is a Senior Software Development Engineer on the Amazon SageMaker team, currently working on improving production machine learning workflows, as well as launching new inference features. In his spare time, he tries to keep up with his kids.