Understanding the SCaNN index in AlloyDB

Over the past year, vector databases have skyrocketed in popularity, and have become the backbone of new semantic search and generative AI experiences. Developers use vector search for everything from product recommendations, to image search, to enhancing LLM-powered chatbots with retrieval augmented generation (RAG).

PostgreSQL is one of the most popular operational databases on the market, used by 49% of developers according to StackOverflow’s 2023 survey, and growing. So, it’s no surprise that pgvector, the most popular PostgreSQL extension for vector search, has become one of the most-loved vector databases on the market. That’s why we launched support for pgvector in Cloud SQL for PostgreSQL and AlloyDB for PostgreSQL in July of last year, adding a few enhancements in AlloyDB AI to optimize performance.

The PostgreSQL community has come a long way since then, introducing support for the HNSW algorithm, a state-of-the art graph-based algorithm used in many popular databases. HNSW is supported in both AlloyDB and Cloud SQL. While HNSW offers good query performance for many vector workloads, we’ve heard from some customers that it doesn’t always fit for their real-world use-cases. Some customers with larger corpuses experience issues with index build time and high memory usage; others need fast, real-time index updates or better vector query performance.

That’s why this week we announced the new ScaNN index for AlloyDB, bringing 12 years of Google research and innovation in approximate nearest neighbor algorithms to AlloyDB. This new index uses the same technology that powers some of Google’s most popular services to deliver up to 4x faster vector queries, up to 8x faster index build times and typically a 3-4x smaller memory footprint than the HNSW index in standard PostgreSQL. It also offers up to 10x higher write throughput than the HNSW index in standard PostgreSQL.

The new ScaNN index is available in technology preview in AlloyDB Omni, and will become available in the AlloyDB for PostgreSQL managed service in Google Cloud shortly thereafter.

Vector indexing using ANN algorithms

The most common use case for vectors is to find similar or relevant data. This is accomplished by querying the database for the k vectors that are closest to the query vector in terms of a distance metric such as inner product, cosine similarity, or Euclidean distance. This kind of query is referred to as a “k (exact) nearest neighbors” or “KNN” query.

Unfortunately, KNN queries don’t scale. This is where Approximate Nearest Neighbor (ANN) search comes in. ANN trades off some accuracy (specifically recall — the algorithm might miss some of the actual nearest neighbors) for big improvements in speed. For many use cases, this tradeoff is worthwhile. Consider, for example, user expectations from a search engine: they’ll happily accept 10 results that are approximately (if not perfectly) the most relevant, if it means they’ll get them in a fraction of a second rather than hours or days.

In the database, ANN search uses vector indexes. Although database performance depends on many factors, the underlying ANN index plays a large role in indexing time, query performance, and memory footprint, and determines the fundamental tradeoffs between recall (i.e., accuracy) and latency.

There are two popular types of ANN indices: graph-based and tree-quantization-based. Graph-based algorithms construct a network of nodes, which are connected by edges based on similarity. pgvector’s HNSW index implements the state-of-the-art Hierarchical Navigable Small Worlds (HNSW) graph algorithm used in many popular vector databases. This uses a hierarchical graph to very efficiently traverse the graph to find nearest neighbors. These types of algorithms perform well, especially for small datasets, but have higher memory footprints and longer index build times than tree-quantization-based algorithms.