Healthcare and life sciences (HCLS) customers are adopting generative AI as a tool to get more from their data. Use cases include document summarization to help readers focus on key points of a document and transforming unstructured text into standardized formats to highlight important attributes. With unique data formats and strict regulatory requirements, customers are looking for choices to select the most performant and cost-effective model, as well as the ability to perform necessary customization (fine-tuning) to fit their business use case. In this post, we walk you through deploying a Falcon large language model (LLM) using Amazon SageMaker JumpStart and using the model to summarize long documents with LangChain and Python.
Solution overview
Amazon SageMaker is built on Amazon’s two decades of experience developing real-world ML applications, including product recommendations, personalization, intelligent shopping, robotics, and voice-assisted devices. SageMaker is a HIPAA-eligible managed service that provides tools that enable data scientists, ML engineers, and business analysts to innovate with ML. Within SageMaker is Amazon SageMaker Studio, an integrated development environment (IDE) purpose-built for collaborative ML workflows, which, in turn, contain a wide variety of quickstart solutions and pre-trained ML models in an integrated hub called SageMaker JumpStart. With SageMaker JumpStart, you can use pre-trained models, such as the Falcon LLM, with pre-built sample notebooks and SDK support to experiment with and deploy these powerful transformer models. You can use SageMaker Studio and SageMaker JumpStart to deploy and query your own generative model in your AWS account.
You can also ensure that the inference payload data doesn’t leave your VPC. You can provision models as single-tenant endpoints and deploy them with network isolation. Furthermore, you can curate and manage the selected set of models that satisfy your own security requirements by using the private model hub capability within SageMaker JumpStart and storing the approved models in there. SageMaker is in scope for HIPAA BAA, SOC123, and HITRUST CSF.
The Falcon LLM is a large language model, trained by researchers at Technology Innovation Institute (TII) on over 1 trillion tokens using AWS. Falcon has many different variations, with its two main constituents Falcon 40B and Falcon 7B, comprised of 40 billion and 7 billion parameters, respectively, with fine-tuned versions trained for specific tasks, such as following instructions. Falcon performs well on a variety of tasks, including text summarization, sentiment analysis, question answering, and conversing. This post provides a walkthrough that you can follow to deploy the Falcon LLM into your AWS account, using a managed notebook instance through SageMaker JumpStart to experiment with text summarization.
The SageMaker JumpStart model hub includes complete notebooks to deploy and query each model. As of this writing, there are six versions of Falcon available in the SageMaker JumpStart model hub: Falcon 40B Instruct BF16, Falcon 40B BF16, Falcon 180B BF16, Falcon 180B Chat BF16, Falcon 7B Instruct BF16, and Falcon 7B BF16. This post uses the Falcon 7B Instruct model.
In the following sections, we show how to get started with document summarization by deploying Falcon 7B on SageMaker Jumpstart.
Prerequisites
For this tutorial, you’ll need an AWS account with a SageMaker domain. If you don’t already have a SageMaker domain, refer to Onboard to Amazon SageMaker Domain to create one.
Deploy Falcon 7B using SageMaker JumpStart
To deploy your model, complete the following steps:
- Navigate to your SageMaker Studio environment from the SageMaker console.
- Within the IDE, under SageMaker JumpStart in the navigation pane, choose Models, notebooks, solutions.
- Deploy the Falcon 7B Instruct model to an endpoint for inference.
This will open the model card for the Falcon 7B Instruct BF16 model. On this page, you can find the Deploy or Train options as well as links to open the sample notebooks in SageMaker Studio. This post will use the sample notebook from SageMaker JumpStart to deploy the model.
- Choose Open notebook.
- Run the first four cells of the notebook to deploy the Falcon 7B Instruct endpoint.
You can see your deployed JumpStart models on the Launched JumpStart assets page.
- In the navigation pane, under SageMaker Jumpstart, choose Launched JumpStart assets.
- Choose the Model endpoints tab to view the status of your endpoint.
With the Falcon LLM endpoint deployed, you are ready to query the model.
Run your first query
To run a query, complete the following steps:
- On the File menu, choose New and Notebook to open a new notebook.
You can also download the completed notebook here.
- Select the image, kernel, and instance type when prompted. For this post, we choose the Data Science 3.0 image, Python 3 kernel, and ml.t3.medium instance.
- Import the Boto3 and JSON modules by entering the following two lines into the first cell:
- Press Shift + Enter to run the cell.
- Next, you can define a function that will call your endpoint. This function takes a dictionary payload and uses it to invoke the SageMaker runtime client. Then it deserializes the response and prints the input and generated text.
The payload includes the prompt as inputs, together with the inference parameters that will be passed to the model.
- You can use these parameters with the prompt to tune the output of the model for your use case:
Query with a summarization prompt
This post uses a sample research paper to demonstrate summarization. The example text file is concerning automatic text summarization in biomedical literature. Complete the following steps:
- Download the PDF and copy the text into a file named
document.txt
. - In SageMaker Studio, choose the upload icon and upload the file to your SageMaker Studio instance.
Out of the box, the Falcon LLM provides support for text summarization.
- Let’s create a function that uses prompt engineering techniques to summarize
document.txt
:
You’ll notice that for longer documents, an error appears—Falcon, alongside all other LLMs, has a limit on the number of tokens passed as input. We can get around this limit using LangChain’s enhanced summarization capabilities, which allows for a much larger input to be passed to the LLM.
Import and run a summarization chain
LangChain is an open-source software library that allows developers and data scientists to quickly build, tune, and deploy custom generative applications without managing complex ML interactions, commonly used to abstract many of the common use cases for generative AI language models in just a few lines of code. LangChain’s support for AWS services includes support for SageMaker endpoints.
LangChain provides an accessible interface to LLMs. Its features include tools for prompt templating and prompt chaining. These chains can be used to summarize text documents that are longer than what the language model supports in a single call. You can use a map-reduce strategy to summarize long documents by breaking it down into manageable chunks, summarizing them, and combining them (and summarized again, if needed).
- Let’s install LangChain to begin:
- Import the relevant modules and break down the long document into chunks:
- To make LangChain work effectively with Falcon, you need to define the default content handler classes for valid input and output:
- You can define custom prompts as
PromptTemplate
objects, the main vehicle for prompting with LangChain, for the map-reduce summarization approach. This is an optional step because mapping and combine prompts are provided by default if the parameters within the call to load the summarization chain (load_summarize_chain
) are undefined.
- LangChain supports LLMs hosted on SageMaker inference endpoints, so instead of using the AWS Python SDK, you can initialize the connection through LangChain for greater accessibility:
- Finally, you can load in a summarization chain and run a summary on the input documents using the following code:
Because the verbose
parameter is set to True
, you’ll see all of the intermediate outputs of the map-reduce approach. This is useful for following the sequence of events to arrive at a final summary. With this map-reduce approach, you can effectively summarize documents much longer than is normally allowed by the model’s maximum input token limit.
Clean up
After you’ve finished using the inference endpoint, it’s important to delete it to avoid incurring unnecessary costs through the following lines of code:
Using other foundation models in SageMaker JumpStart
Utilizing other foundation models available in SageMaker JumpStart for document summarization requires minimal overhead to set up and deploy. LLMs occasionally vary with the structure of input and output formats, and as new models and pre-made solutions are added to SageMaker JumpStart, depending on the task implementation, you may have to make the following code changes:
- If you are performing summarization via the
summarize()
method (the method without using LangChain), you may have to change the JSON structure of thepayload
parameter, as well as the handling of the response variable in thequery_endpoint()
function - If you are performing summarization via LangChain’s
load_summarize_chain()
method, you may have to modify theContentHandlerTextSummarization
class, specifically thetransform_input()
andtransform_output()
functions, to correctly handle the payload that the LLM expects and the output the LLM returns
Foundation models vary not only in factors such as inference speed and quality, but also input and output formats. Refer to the LLM’s relevant information page on expected input and output.
Conclusion
The Falcon 7B Instruct model is available on the SageMaker JumpStart model hub and performs on a number of use cases. This post demonstrated how you can deploy your own Falcon LLM endpoint into your environment using SageMaker JumpStart and do your first experiments from SageMaker Studio, allowing you to rapidly prototype your models and seamlessly transition to a production environment. With Falcon and LangChain, you can effectively summarize long-form healthcare and life sciences documents at scale.
For more information on working with generative AI on AWS, refer to Announcing New Tools for Building with Generative AI on AWS. You can start experimenting and building document summarization proofs of concept for your healthcare and life science-oriented GenAI applications using the method outlined in this post. When Amazon Bedrock is generally available, we will publish a follow-up post showing how you can implement document summarization using Amazon Bedrock and LangChain.
About the Authors
John Kitaoka is a Solutions Architect at Amazon Web Services. John helps customers design and optimize AI/ML workloads on AWS to help them achieve their business goals.
Josh Famestad is a Solutions Architect at Amazon Web Services. Josh works with public sector customers to build and execute cloud based approaches to deliver on business priorities.