Context-aware code generation: RAG and Vertex AI Codey APIs

Retrieval augmented generation, or RAG, is a way to use external data or information to improve the accuracy of large language models (LLMs). Today, we’ll explore how to use RAG to improve the output quality of Google Cloud AI models for code completion and generation on Vertex AI using its Codey APIs, a suite of code generation models that can help software developers complete coding tasks faster. There are three Codey APIs that help boost developer productivity:

Code completion: Get instant code suggestions based on your current context, making coding a seamless and efficient experience. This API is designed to be integrated into IDEs, editors, and other applications to provide low-latency code autocompletion suggestions as you write code.
Code generation: Generate code snippets for functions, classes, and more in seconds by describing the code you need in natural language. This API can be helpful when you need to write a lot of code quickly or when you’re not sure how to start. It can be integrated into IDEs, editors, and other applications including CI/CD workflows.
Code chat: Get help on your coding journey throughout the software development lifecycle, from debugging tricky issues to expanding your knowledge with insightful suggestions and answers. This multi-turn chat API can be integrated into IDEs, and editors as a chat assistant. It can also be used in batch workflows.

These models also integrate Responsible AI capabilities, such as source citation and toxicity checking, which automatically cite or block code based on Responsible AI guidelines set by Google.

The Codey APIs deliver far more than generic code generation, allowing you to tailor code output to your organization’s specific style and securely access private code repositories based on your organization’s guidelines. The ability to customize these models helps you generate code that complies with established coding standards and conventions while leveraging custom endpoints and proprietary codebases for code generation tasks.

To achieve this level of customization, you can tune models using specific datasets such as your company’s codebase. Alternatively, you can also utilize RAG to incorporate external knowledge sources into the code generation process, which we will now discuss in detail below.

What is RAG?

Traditional large language models are limited by their internal knowledge base, which can lead to responses that are irrelevant or lack context. RAG addresses this issue by integrating an external retrieval system into LLMs, enabling them to access and utilize relevant information on the fly.

This technique allows LLMs to retrieve information from an authoritative external source, augment their input with relevant context, and generate more informed, accurate responses. Code generation models, for instance, can use RAG to fetch relevant information from existing code repositories and use it to create accurate code, documentation, or even fix code errors.

How does RAG work?

Implementing RAG requires a robust retrieval system capable of delivering relevant documents based on user queries.

Here’s a quick overview of how a RAG system works for code generation:

The retrieval mechanism fetches relevant information from a data source. This information can be in the form of code, text, or other types of data.
The generation mechanism — i.e., your code generation LLM — uses the retrieved information to generate its output.
The generated code is now more relevant to the input query or question.

While you can employ various approaches, the most common RAG pattern involves generating embeddings for chunks of source information and indexing them in a vector database, such as Vertex AI Vector Search.

The diagram below shows a high-level RAG pattern for code generation with Codey APIs.