How Patsnap used GPT-2 inference on Amazon SageMaker with low latency and price

This weblog publish was co-authored, and consists of an introduction, by Zilong Bai, senior pure language processing engineer at Patsnap.

You’re possible acquainted with the autocomplete suggestion function if you seek for one thing on Google or Amazon. Though the search phrases in these eventualities are fairly widespread key phrases or expressions that we use in every day life, in some instances search phrases are very particular to the state of affairs. Patent search is one among them. Lately, the AWS Generative AI Innovation Heart collaborated with Patsnap to implement a function to routinely counsel search key phrases as an innovation exploration to enhance consumer experiences on their platform.

Patsnap offers a world one-stop platform for patent search, evaluation, and administration. They use massive information (similar to a historical past of previous search queries) to supply many highly effective but easy-to-use patent instruments. These instruments have enabled Patsnap’s international clients to have a greater understanding of patents, monitor current technological advances, establish innovation tendencies, and analyze opponents in actual time.

On the identical time, Patsnap is embracing the ability of machine studying (ML) to develop options that may repeatedly enhance consumer experiences on the platform. A current initiative is to simplify the issue of developing search expressions by autofilling patent search queries utilizing state-of-the-art textual content technology fashions. Patsnap had educated a personalized GPT-2 mannequin for such a goal. As a result of there isn’t a such current function in a patent search engine (to their finest data), Patsnap believes including this function will improve end-user stickiness.

Nevertheless, of their current experiments, the inference latency and queries per second (QPS) of a PyTorch-based GPT-2 mannequin couldn’t meet sure thresholds that may justify its enterprise worth. To sort out this problem, AWS Generative AI Innovation Heart scientists explored quite a lot of options to optimize GPT-2 inference efficiency, leading to decreasing the mannequin latency by 50% on common and enhancing the QPS by 200%.

Giant language mannequin inference challenges and optimization approaches

Basically, making use of such a big mannequin in a real-world manufacturing atmosphere is non-trivial. The prohibitive computation value and latency of PyTorch-based GPT-2 made it tough to be broadly adopted from a enterprise operation perspective. On this mission, our goal is to considerably enhance the latency with cheap computation prices. Particularly, Patsnap requires the next:

  • The common latency of mannequin inference for producing search expressions must be managed inside 600 milliseconds in real-time search eventualities
  • The mannequin requires excessive throughput and QPS to do a lot of searches per second throughout peak enterprise hours

On this publish, we focus on our findings utilizing Amazon Elastic Compute Cloud (Amazon EC2) situations, that includes GPU-based situations utilizing NVIDIA TensorRT.

In a brief abstract, we use NVIDIA TensorRT to optimize the latency of GPT-2 and deploy it to an Amazon SageMaker endpoint for mannequin serving, which reduces the typical latency from 1,172 milliseconds to 531 milliseconds

Within the following sections, we go over the technical particulars of the proposed options with key code snippets and present comparisons with the client’s establishment based mostly on key metrics.

GPT-2 mannequin overview

Open AI’s GPT-2 is a big transformer-based language mannequin with 1.5 billion parameters, educated on the WebText dataset, containing 8 million net pages. The GPT-2 is educated with a easy goal: predict the subsequent phrase, given all the earlier phrases inside some textual content. The variety of the dataset causes this easy aim to include naturally occurring demonstrations of many duties throughout various domains. GPT-2 shows a broad set of capabilities, together with the power to generate conditional artificial textual content samples of unprecedented high quality, the place we prime the mannequin with an enter and let it generate a prolonged continuation. On this state of affairs, we exploit it to generate search queries. As GPT fashions continue to grow bigger, inference prices are repeatedly rising, which will increase the necessity to deploy these fashions with acceptable value.

Obtain low latency on GPU situations through TensorRT

TensorRT is a C++ library for high-performance inference on NVIDIA GPUs and deep studying accelerators, supporting main deep studying frameworks similar to PyTorch and TensorFlow. Earlier research have proven nice efficiency enchancment by way of mannequin latency. Subsequently, it’s a super selection for us to scale back the latency of the goal mannequin on NVIDIA GPUs.

We’re capable of obtain a major discount in GPT-2 mannequin inference latency with a TensorRT-based mannequin on NVIDIA GPUs. The TensorRT-based mannequin is deployed through SageMaker for efficiency exams. On this publish, we present the steps to transform the unique PyTorch-based GPT-2 mannequin to a TensorRT-based mannequin.

Changing the PyTorch-based GPT-2 to the TensorRT-based mannequin isn’t tough through the official tool offered by NVIDIA. As well as, with such simple conversions, no apparent mannequin accuracy degradation has been noticed. Basically, there are three steps to comply with:

  1. Analyze your GPT-2. As of this writing, NVIDIA’s conversion instrument solely helps Hugging Face’s model of GPT-2 mannequin. If the present GPT-2 mannequin isn’t the unique model, it is advisable modify it accordingly. It’s beneficial to strip out customized code from the unique GPT-2 implementation of Hugging Face, which could be very useful for the conversion.
  2. Set up the required Python packages. The conversion course of first converts the PyTorch-based mannequin to the ONNX mannequin after which converts the ONNX-based mannequin to the TensorRT-based mannequin. The next Python packages are wanted for this two-step conversion:

  1. Convert your mannequin. The next code accommodates the capabilities for the two-step conversion:
def torch2onnx():
    metadata = NetworkMetadata(variant=GPT2_VARIANT, precision=Precision(fp16=True), different=GPT2Metadata(kv_cache=False))
    gpt2 = GPT2TorchFile('cpu'), metadata)
    onnx_path = ('Your personal path to avoid wasting ONNX-based mannequin') # e.g, ./model_fp16.onnx
    gpt2.as_onnx_model(onnx_path, force_overwrite=False)
    return onnx_path, metadata
def onnx2trt(onnx_path, metadata):
    trt_path="Your personal path to avoid wasting TensorRT-based mannequin" # e.g., ./model_fp16.onnx.engine
    batch_size = 10
    max_sequence_length = 42
    profiles = [Profile().add(
        min=(1, 1),
        opt=(batch_size, max_sequence_length // 2),
        max=(batch_size, max_sequence_length),
    gpt2_engine = GPT2ONNXFile(onnx_path, metadata).as_trt_engine(output_fpath=trt_path, profiles=profiles)
    gpt2_trt = GPT2TRTDecoder(gpt2_engine, metadata, config, max_sequence_length=42, batch_size=10)

Latency comparability: PyTorch vs. TensorRT

JMeter is used for efficiency benchmarking on this mission. JMeter is an Apache mission that can be utilized as a load testing instrument for analyzing and measuring the efficiency of quite a lot of providers. We file the QPS and latency of the unique PyTorch-based mannequin and our transformed TensorRT-based GPT-2 mannequin on an AWS P3.2xlarge occasion. As we present later on this publish, as a result of highly effective acceleration capability of TensorRT, the latency of GPT-2 is considerably diminished. When the request concurrency is 1, the typical latency has been diminished by 274 milliseconds (2.9 occasions sooner). From the attitude of QPS, it’s elevated to 7 from 2.4, which is round a 2.9 occasions enhance in comparison with the unique PyTorch-based mannequin. Furthermore, because the concurrency will increase, QPS retains growing. This means decrease prices with acceptable latency improve (however nonetheless a lot sooner than the unique mannequin).

The next desk compares latency:

. Concurrency QPS Most Latency Minumum Latency Common Latency
Buyer PyTorch model (on p3.2xlarge) 1 2.4 632 105 417
2 3.1 919 168 636
3 3.4 1911 222 890
4 3.4 2458 277 1172
AWS TensorRT model (on p3.2xlarge) 1 7 (+4.6) 275 22 143 (-274 ms)
2 7.2 (+4.1) 274 51 361 (-275 ms)
3 7.3 (+3.9) 548 49 404 (-486 ms)
4 7.5 (+4.1) 765 62 531 (-641 ms)

Deploy TensorRT-based GPT-2 with SageMaker and a customized container

TensorRT-based GPT-2 requires a comparatively current TensorRT model, so we select the bring your own container (BYOC) mode of SageMaker to deploy our mannequin. BYOC mode offers a versatile option to deploy the mannequin, and you may construct personalized environments in your individual Docker container. On this part, we present find out how to construct your individual container, deploy your individual GPT-2 mannequin, and take a look at with the SageMaker endpoint API.

Construct your individual container

The container’s file listing is offered within the following code. Particularly, Dockerfile and are used to construct the Docker container. gpt2 and implement the mannequin and the inference API. serve, nginx.conf, and present the configuration for the NGINX net server.

├── Dockerfile    # construct our docker based mostly on this file.
├──      # create our personal picture and push it to Amazon ECR
├── gpt2          # mannequin listing
├──  # backend perform for invoke the mannequin
├── serve         # net server setting file
├── nginx.conf    # net server setting file
└──       # net server setting file

You possibly can run sh ./ to construct the container.

Deploy to a SageMaker endpoint

After you have got constructed a container to run the TensorRT-based GPT-2, you may allow real-time inference through a SageMaker endpoint. Use the next code snippets to create the endpoint and deploy the mannequin to the endpoint utilizing the corresponding SageMaker APIs:

import boto3from time import gmtime, strftime
from sagemaker import get_execution_role

sm_client = boto3.shopper(service_name="sagemaker")
runtime_sm_client = boto3.shopper(service_name="sagemaker-runtime")
account_id = boto3.shopper('sts').get_caller_identity()['Account']
area = boto3.Session().region_name
s3_bucket="${Your s3 bucket}"
function = get_execution_role()
model_name="${Your Mannequin Identify}"
# it is advisable add your container to S3 first
container="${Your Picture Path}"
container = {
    'Picture': container
create_model_response = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = function,
    Containers = [container])
# Endpoint Setting
endpoint_config_name="${Your Endpoint Config Identify}"
print('Endpoint config title: ' + endpoint_config_name)
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
        'InstanceType': instance_type,
        'InitialInstanceCount': 1,
        'InitialVariantWeight': 1,
        'ModelName': model_name,
        'VariantName': 'AllTraffic'}])
print("Endpoint config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

# Deploy Mannequin
endpoint_name="${Your Endpoint Identify}"
print('Endpoint title: ' + endpoint_name)
create_endpoint_response = sm_client.create_endpoint(
print('Endpoint Arn: ' + create_endpoint_response['EndpointArn'])
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
standing = resp['EndpointStatus']
print("Endpoint Standing: " + standing)
print('Ready for {} endpoint to be in service...'.format(endpoint_name))
waiter = sm_client.get_waiter('endpoint_in_service')

Take a look at the deployed mannequin

After the mannequin is efficiently deployed, you may take a look at the endpoint through the SageMaker pocket book occasion with the next code:

import json
import boto3

sagemaker_runtime = boto3.shopper("sagemaker-runtime", region_name="us-east-2")
endpoint_name = "${Your Endpoint Identify}"
request_body = {"enter": "amazon"}
payload = json.dumps(request_body)
content_type = "software/json"
response = sagemaker_runtime.invoke_endpoint(
                            Physique=payload # Exchange with your individual information.
end result = json.hundreds(response['Body'].learn().decode())
print(end result)


On this publish, we described find out how to allow low-latency GPT-2 inference on SageMaker to create enterprise worth. Particularly, with the assist of NVIDIA TensorRT, we will obtain 2.9 occasions acceleration on the NVIDIA GPU situations with SageMaker for a personalized GPT-2 mannequin.

If you would like assist with accelerating using GenAI fashions in your services and products, please contact the AWS Generative AI Innovation Heart. The AWS Generative AI Innovation Heart may also help you make your concepts a actuality sooner and extra successfully. To get began with the Generative AI Innovation Heart, go to here.

In regards to the Authors

Hao Huang
is an utilized scientist on the AWS Generative AI Innovation Heart. He makes a speciality of Pc Imaginative and prescient (CV) and Visible-Language Mannequin (VLM). Lately, he has developed a powerful curiosity in generative AI applied sciences and has already collaborated with clients to use these cutting-edge applied sciences to their enterprise. He’s additionally a reviewer for AI conferences similar to ICCV and AAAI.

Zilong Bai is a senior pure language processing engineer at Patsnap. He’s enthusiastic about analysis and proof-of-concept work on cutting-edge methods for generative language fashions.

Yuanjun Xiao is a Answer Architect at AWS. He’s liable for AWS structure consulting and design. He’s additionally enthusiastic about constructing AI and analytic options.

Xuefei Zhang is an utilized scientist on the AWS Generative AI Innovation Heart, works in NLP and AGI areas to resolve business issues with clients.

Guang Yang is a senior utilized scientist on the AWS Generative AI Innovation Heart the place he works with clients throughout numerous verticals and applies artistic drawback fixing to generate worth for purchasers with state-of-the-art ML/AI options.

A number of GPU coaching in PyTorch and Gradient Accumulation as a substitute for it | by Alexey Kravets | Jul, 2023

Optimize AWS Inferentia utilization with FastAPI and PyTorch fashions on Amazon EC2 Inf1 & Inf2 situations