Automate caption creation and seek for photographs at enterprise scale utilizing generative AI and Amazon Kendra

Amazon Kendra is an clever search service powered by machine studying (ML). Amazon Kendra reimagines seek for your web sites and functions so your staff and clients can simply discover the content material they’re on the lookout for, even when it’s scattered throughout a number of areas and content material repositories inside your group.

Amazon Kendra helps quite a lot of document formats, akin to Microsoft Phrase, PDF, and textual content from various data sources. On this submit, we give attention to extending the doc assist in Amazon Kendra to make photographs searchable by their displayed content material. Pictures can typically be searched utilizing supplemented metadata akin to key phrases. Nonetheless, it takes a whole lot of handbook effort so as to add detailed metadata to doubtlessly hundreds of photographs. Generative AI (GenAI) may be useful in producing the metadata mechanically. By producing textual captions, the GenAI caption predictions supply descriptive metadata for photographs. The Amazon Kendra index can then be enriched with the generated metadata throughout doc ingestion to allow looking out the photographs with none handbook effort.

For example, a GenAI mannequin can be utilized to generate a textual description for the next picture as “a canine laying on the bottom below an umbrella” throughout doc ingestion of the picture.

Image of a dog laying under an umbrella as an example of what can be searched in this solution

An object recognition mannequin can nonetheless detect key phrases akin to “canine” and “umbrella,” however a GenAI mannequin affords deeper understanding of what’s represented within the picture by figuring out that the canine lies below the umbrella. This helps us construct extra refined searches within the picture search course of. The textual description is added as metadata to an Amazon Kendra search index through an automatic customized doc enrichment (CDE). Customers looking for phrases like “canine” or “umbrella” will then have the ability to discover the picture, as proven within the following screenshot.

Image of Kendra search tool

On this submit, we present the right way to use CDE in Amazon Kendra utilizing a GenAI mannequin deployed on Amazon SageMaker. We reveal CDE utilizing easy examples and supply a step-by-step information so that you can expertise CDE in an Amazon Kendra index in your individual AWS account. It permits customers to rapidly and simply discover the photographs they want with out having to manually tag or categorize them. This answer may also be custom-made and scaled to fulfill the wants of various functions and industries.

Picture captioning with GenAI

Picture description with GenAI includes utilizing ML algorithms to generate textual descriptions of photographs. The method is often known as picture captioning, and operates on the intersection of pc imaginative and prescient and pure language processing (NLP). It has functions in areas the place knowledge is multi-modal akin to ecommerce, the place knowledge accommodates textual content within the type of metadata in addition to photographs, or in healthcare, the place knowledge may include MRIs or CT scans together with physician’s notes and diagnoses, to call a number of use circumstances.

GenAI fashions be taught to acknowledge objects and options throughout the photographs, after which generate descriptions of these objects and options in pure language. The state-of-the-art fashions use an encoder-decoder structure, the place the picture data is encoded within the intermediate layers of the neural community and decoded into textual descriptions. These may be thought-about as two distinct phases: characteristic extraction from photographs and textual caption technology. Within the characteristic extraction stage (encoder), the GenAI mannequin processes the picture to extract related visible options, akin to object shapes, colours, and textures. Within the caption technology stage (decoder), the mannequin generates a pure language description of the picture primarily based on the extracted visible options.

GenAI fashions are sometimes educated on huge quantities of information, which make them appropriate for varied duties with out further coaching. Adapting to customized datasets and new domains can also be simply achievable by way of few-shot studying. Pre-training strategies permit multi-modal functions to be simply educated utilizing state-of-the-art language and picture fashions. These pre-training strategies additionally mean you can combine and match the imaginative and prescient mannequin and language mannequin that most closely fits your knowledge.

The standard of the generated picture descriptions depends upon the standard and measurement of the coaching knowledge, the structure of the GenAI mannequin, and the standard of the characteristic extraction and caption technology algorithms. Though picture description with GenAI is an lively space of analysis, it reveals superb ends in a variety of functions, akin to picture search, visible storytelling, and accessibility for folks with visible impairments.

Use circumstances

GenAI picture captioning is helpful within the following use circumstances:

  • Ecommerce – A standard business use case the place photographs and textual content happen collectively is retail. Ecommerce specifically shops huge quantities of information as product photographs together with textual descriptions. The textual description or metadata is vital to make sure that the perfect merchandise are exhibited to the consumer primarily based on the search queries. Furthermore, with the development of ecommerce websites acquiring knowledge from 3P distributors, the product descriptions are sometimes incomplete, amounting to quite a few handbook hours and big overhead ensuing from tagging the correct data within the metadata columns. GenAI-based picture captioning is especially helpful for automating this laborious course of. Tremendous-tuning the mannequin on customized style knowledge akin to style photographs together with textual content describing the attributes of style merchandise can be utilized to generate metadata that then improves a consumer’s search expertise.
  • Advertising – One other use case of picture search is digital asset administration. Advertising corporations retailer huge quantities of digital knowledge that must be centralized, simply searchable, and scalable enabled by knowledge catalogs. A centralized knowledge lake with informative knowledge catalogs would cut back duplication efforts and allow wider sharing of artistic content material and consistency between groups. For graphic design platforms popularly used for enabling social media content material technology, or displays in company settings, a quicker search may lead to an improved consumer expertise by rendering the right search outcomes for the photographs that customers need to search for and enabling customers to go looking utilizing pure language queries.
  • Manufacturing – The manufacturing business shops a whole lot of picture knowledge like structure blueprints of parts, buildings, {hardware}, and tools. The power to go looking by way of such knowledge allows product groups to simply recreate designs from a place to begin that already exists and eliminates a whole lot of design overhead, thereby dashing up the method of design technology.
  • Healthcare – Docs and medical researchers can catalog and search by way of MRIs and CT scans, specimen samples, photographs of the ailment akin to rashes and deformities, together with physician’s notes, diagnoses, and medical trials particulars.
  • Metaverse or augmented actuality – Promoting a product is about making a story that customers can think about and relate to. With AI-powered instruments and analytics, it has grow to be simpler than ever to construct not only one story however custom-made tales to seem to end-users’ distinctive tastes and sensibilities. That is the place image-to-text fashions is usually a recreation changer. Visible storytelling can help in creating characters, adapting them to totally different kinds, and captioning them. It may also be used to energy stimulating experiences within the metaverse or augmented actuality and immersive content material together with video video games. Picture search allows builders, designers, and groups to go looking their content material utilizing pure language queries, which may preserve consistency of content material between varied groups.
  • Accessibility of digital content material for blind and low imaginative and prescient – That is primarily enabled by assistive applied sciences akin to screenreaders, Braille programs that permit contact studying and writing, and particular keyboards for navigating web sites and functions throughout the web. Pictures, nevertheless, must be delivered as textual content material that may then be communicated as speech. Picture captioning utilizing GenAI algorithms is an important piece for redesigning the web and making it extra inclusive by offering everybody an opportunity to entry, perceive, and work together with on-line content material.

Mannequin particulars and mannequin fine-tuning for customized datasets

On this answer, we benefit from the vit-gpt2-image-captioning mannequin accessible from Hugging Face, which is licensed below Apache 2.0 with out performing any additional fine-tuning. Vit is a foundational mannequin for picture knowledge, and GPT-2 is a foundational mannequin for language. The multi-modal mixture of the 2 affords the aptitude of picture captioning. Hugging Face hosts state-of-the-art picture captioning fashions, which may be deployed in AWS in a number of clicks and supply simple-to-deploy inference endpoints. Though we will use this pre-trained mannequin instantly, we will additionally customise the mannequin to suit domain-specific datasets, extra knowledge varieties akin to video or spatial knowledge, and distinctive use circumstances. There are a number of GenAI fashions the place some fashions carry out finest with sure datasets, or your crew would possibly already be utilizing imaginative and prescient and language fashions. This answer affords the pliability of selecting the best-performing imaginative and prescient and language mannequin because the picture captioning mannequin by way of simple substitute of the mannequin we have now used.

For personalization of the fashions to distinctive business functions, open-source fashions accessible on AWS by way of Hugging Face supply a number of prospects. A pre-trained mannequin may be examined for the distinctive dataset or educated on samples of the labeled knowledge to fine-tune it. Novel analysis strategies additionally permit any mixture of imaginative and prescient and language fashions to be mixed effectively and educated in your dataset. This newly educated mannequin can then be deployed in SageMaker for the picture captioning described on this answer.

An instance of a custom-made picture search is Enterprise Useful resource Planning (ERP). In ERP, picture knowledge collected from totally different phases of logistics or provide chain administration may embrace tax receipts, vendor orders, payslips, and extra, which must be mechanically categorized for the purview of various groups throughout the group. One other instance is to make use of medical scans and physician diagnoses to foretell new medical photographs for automated classification. The imaginative and prescient mannequin extracts options from the MRI, CT, or X-ray photographs and the textual content mannequin captions it with the medical diagnoses.

Answer overview

The next diagram reveals the structure for picture search with GenAI and Amazon Kendra.

Architecture of proposed solution

We ingest photographs from Amazon Simple Storage Service (Amazon S3) into Amazon Kendra. Throughout ingestion to Amazon Kendra, the GenAI mannequin hosted on SageMaker is invoked to generate a picture description. Moreover, textual content seen in a picture is extracted by Amazon Textract. The picture description and the extracted textual content are saved as metadata and made accessible to the Amazon Kendra search index. After ingestion, photographs may be searched through the Amazon Kendra search console, API, or SDK.

We use the superior operations of CDE in Amazon Kendra to name the GenAI mannequin and Amazon Textract in the course of the picture ingestion step. Nonetheless, we will use CDE for a wider vary of use circumstances. With CDE, you possibly can create, modify, or delete doc attributes and content material if you ingest your paperwork into Amazon Kendra. This implies you possibly can manipulate and ingest your knowledge as wanted. This may be achieved by invoking pre- and post-extraction AWS Lambda capabilities throughout ingestion, which permits for knowledge enrichment or modification. For instance, we will use Amazon Medical Comprehend when ingesting medical textual knowledge so as to add ML-generated insights to the search metadata.

You should use our answer to go looking photographs by way of Amazon Kendra by following these steps:

  1. Add photographs to a picture repository like an S3 bucket.
  2. The picture repository is then listed by Amazon Kendra, which is a search engine that can be utilized to seek for structured and unstructured knowledge. Throughout indexing, the GenAI mannequin in addition to Amazon Textract are invoked to generate the picture metadata. You possibly can set off the indexing manually or on a predefined schedule.
  3. You possibly can then seek for photographs utilizing pure language queries, akin to “Discover photographs of purple roses” or “Present me photos of canines enjoying within the park,” by way of the Amazon Kendra console, SDK, or API. These queries are processed by Amazon Kendra, which makes use of ML algorithms to grasp the that means behind the queries and retrieve related photographs from the listed repository.
  4. The search outcomes are offered to you, together with their corresponding textual descriptions, permitting you to rapidly and simply discover the photographs you’re on the lookout for.


You will need to have the next stipulations:

  • An AWS account
  • Permissions to provision and invoke the next companies through AWS CloudFormation: Amazon S3, Amazon Kendra, Lambda, and Amazon Textract.

Value estimate

The price of deploying this answer as a proof of idea is projected within the following desk. That is the rationale we use Amazon Kendra with the Developer Version, which isn’t really helpful for manufacturing workloads, however offers a low-cost choice for builders. We assume that the search performance of Amazon Kendra is used for 20 working days for 3 hours every day, and subsequently calculate related prices for 60 month-to-month lively hours.

Service Time Consumed Value Estimate per Month
Amazon S3 Storage of 10 GB with knowledge switch 2.30 USD
Amazon Kendra Developer Version with 60 hours/month 67.90 USD
Amazon Textract 100% detect doc textual content on 10,000 photographs 15.00 USD
Amazon SageMaker Actual-time inference with ml.g4dn.xlarge for one mannequin deployed on one endpoint for 3 hours day-after-day for 20 days 44.00 USD
. . 129.2 USD

Deploy assets with AWS CloudFormation

The CloudFormation stack deploys the next assets:

  • A Lambda operate that downloads the picture captioning mannequin from Hugging Face hub and subsequently builds the mannequin property
  • A Lambda operate that populates the inference code and zipped mannequin artifacts to a vacation spot S3 bucket
  • An S3 bucket for storing the zipped mannequin artifacts and inference code
  • An S3 bucket for storing the uploaded photographs and Amazon Kendra paperwork
  • An Amazon Kendra index for looking out by way of the generated picture captions
  • A SageMaker real-time inference endpoint for deploying the Hugging Face picture
  • captioning mannequin
  • A Lambda operate that’s triggered whereas enriching the Amazon Kendra index on demand. It invokes Amazon Textract and a SageMaker real-time inference endpoint.

Moreover, AWS CloudFormation deploys all the mandatory AWS Identity and Access

Management (IAM) roles and insurance policies, a VPC together with subnets, a safety group, and an web gateway wherein the customized useful resource Lambda operate is run.

Full the next steps to provision your assets:

  1. Select Launch stack to launch the CloudFormation template within the us-east-1 Area:
  2. Select Subsequent.
  3. On the Specify stack particulars web page, go away the template URL and S3 URI of the parameters file at their defaults, then select Subsequent.
  4. Proceed to decide on Subsequent on the following pages.
  5. Select Create stack to deploy the stack.

Monitor the standing of the stack. When the standing reveals as CREATE_COMPLETE, the deployment is full.

Ingest and search instance photographs

Full the next steps to ingest and search your photographs:

  1. On the Amazon S3 console, create a folder known as photographs within the kendra-image-search-stack-imagecaptions S3 bucket within the us-east-1 Area.
  2. Add the next photographs to the photographs folder.

Image of a beach to test with the kendra image search using automated text captioningImage of a dog celebrating a birthday to test with the kendra image search using automated text captioningImage of a dog under an umbrella to test with the kendra image search using automated text captioningImage of a tablet, notebook and coffee on a desk to test with the kendra image search using automated text captioning

  1. Navigate to the Amazon Kendra console in us-east-1 Area.
  2. Within the navigation pane, select Indexes, then select your index (kendra-index).
  3. Select Knowledge sources, then select generated_image_captions.
  4. Select Sync now.

Look ahead to the synchronization to be full earlier than persevering with to the following steps.

  1. Within the navigation pane, select Indexes, then select kendra-index.
  2. Navigate to the search console.
  3. Attempt the next queries individually or mixed: “canine,” “umbrella,” and “publication,” and discover out which photographs are ranked excessive by Amazon Kendra.

Be at liberty to check your individual queries that match the uploaded photographs.

Clear up

To deprovisioning all of the assets, full the next step

  1. On the AWS CloudFormation console, select Stacks within the navigation pane.
  2. Choose the stack kendra-genai-image-search and select Delete.

Wait till the stack standing adjustments to DELETE_COMPLETE.


On this submit, we noticed how Amazon Kendra and GenAI may be mixed to automate the creation of significant metadata for photographs. State-of-the-art GenAI fashions are extraordinarily helpful for producing textual content captions describing the content material of a picture. This has a number of business use circumstances, starting from healthcare and life sciences, retail and ecommerce, digital asset platforms, and media. Picture captioning can also be essential for constructing a extra inclusive digital world and redesigning the web, metaverse, and immersive applied sciences to cater to the wants of visually challenged sections of society.

Picture search enabled by way of captions allows digital content material to be simply searchable with out handbook effort for these functions, and removes duplication efforts. The CloudFormation template we supplied makes it simple to deploy this answer to allow picture search utilizing Amazon Kendra. A easy structure of photographs saved in Amazon S3 and GenAI to create textual descriptions of the photographs can be utilized with CDE in Amazon Kendra to energy this answer.

This is just one software of GenAI with Amazon Kendra. To dive deeper into the right way to construct GenAI functions with Amazon Kendra, seek advice from Quickly build high-accuracy Generative AI applications on enterprise data using Amazon Kendra, LangChain, and large language models. For constructing and scaling GenAI functions, we advocate testing Amazon Bedrock.

In regards to the Authors

Charalampos Grouzakis is a Knowledge Scientist inside AWS Skilled Providers. He has over 11 years of expertise in growing and main knowledge science, machine studying, and massive knowledge initiatives. At present he’s serving to enterprise clients modernizing their AI/ML workloads throughout the cloud utilizing business finest practices. Previous to becoming a member of AWS, he was consulting clients in varied industries akin to Automotive, Manufacturing, Telecommunications, Media & Leisure, Retail and Monetary Providers. He’s enthusiastic about enabling clients to speed up their AI/ML journey within the cloud and to drive tangible enterprise outcomes.

Bharathi Srinivasan
is a Knowledge Scientist at AWS Skilled Providers the place she likes to construct cool issues on Sagemaker. She is enthusiastic about driving enterprise worth from machine studying functions, with a give attention to moral AI. Outdoors of constructing new AI experiences for patrons, Bharathi loves to jot down science fiction and problem herself with endurance sports activities.

Jean-Michel Lourier is a Senior Knowledge Scientist inside AWS Skilled Providers. He leads groups implementing knowledge pushed functions aspect by aspect with AWS clients to generate enterprise worth out of their knowledge. He’s enthusiastic about diving into tech and studying about AI, machine studying, and their enterprise functions. He’s additionally an enthusiastic bike owner, taking lengthy bike-packing journeys.

Tanvi Singhal is a Knowledge Scientist inside AWS Skilled Providers. Her expertise and areas of experience embrace knowledge science, machine studying, and massive knowledge. She helps clients in growing Machine studying fashions and MLops options throughout the cloud. Previous to becoming a member of AWS, she was additionally a advisor in varied industries akin to Transportation Networking, Retail and Monetary Providers. She is enthusiastic about enabling clients on their knowledge/AI journey to the cloud.

Abhishek Maligehalli Shivalingaiah is a Senior AI Providers Answer Architect at AWS with give attention to Amazon Kendra. He’s enthusiastic about constructing functions utilizing Amazon Kendra ,Generative AI and NLP. He has round 10 years of expertise in constructing Knowledge & AI options to create worth for patrons and enterprises. He has constructed a (private) chatbot for enjoyable to solutions questions on his profession {and professional} journey. Outdoors of labor he enjoys making portraits of household & mates, and loves creating artworks.

Construct a customized avatar with generative AI utilizing Amazon SageMaker

Mastering Monte Carlo: The way to Simulate Your Solution to Higher Machine Studying Fashions | by Sydney Nye | Aug, 2023