RAG quickstart with Ray, LangChain, and HuggingFace

LLMs rely on their training data, which can quickly fall out of date and may not include data relevant to the application’s domain. Re-training or fine-tuning an LLM to provide fresh, domain-specific data can be an expensive and complex process. RAG not only gives the LLM access to such data without training or-fine tuning. but can also guide an LLM toward factual responses, thereby reducing hallucinations and enabling applications to provide human-verifiable source material.

For more background on how RAG works, see our blog on context-aware code generation.

AI Infrastructure for RAG

Prior to the rise of Generative AI, a typical application architecture might involve a database, a set of microservices, and a frontend. Even the most basic RAG applications introduce new requirements for serving LLMs, processing, and retrieving unstructured data. To meet these requirements, customers need infrastructure that is optimized specifically for AI workloads.

Many customers choose to access AI infrastructure like TPUs and GPUs via a fully managed platform, such as Vertex AI. Others, however, prefer to manage their own infrastructure on top of GKE while leveraging open-source frameworks and open models. This blog post is for the latter group.

Building an AI platform from scratch involves a number of key decisions, such as which frameworks to use for model serving, which machine shapes to use for inference, how to protect sensitive data, how to meet cost and performance requirements, and how to scale as traffic grows. Each decision involves many tradeoffs against a vast and fast-changing landscape of generative AI tools.

This is why we have developed a quickstart solution and reference architecture for RAG applications built on top of GKE, Cloud SQL, and open-source frameworks Ray, LangChain and Hugging Face. Our solution is designed to help you get started quickly and accelerate your journey to production with RAG best practices built-in from the start.

Benefits of RAG on GKE and Cloud SQL

GKE and Cloud SQL accelerate your journey to production in a variety of ways:

Deploying RAG on GKE and Cloud SQL

Our end-to-end RAG application and reference architecture provide the following:

Google Cloud project – configures your project with the needed prerequisites to run the RAG application, including a GKE Cluster and Cloud SQL for PostgreSQL and pgvector instance
AI frameworks – deploys Ray, JupyterHub, and Hugging Face TGI to GKE
RAG Embedding Pipeline – generates embeddings and populates the Cloud SQL for PostgreSQL and pgvector instance
Example RAG Chatbot Application – deploys a web-based RAG chatbot to GKE