in

Environment friendly semantic search over unstructured textual content in Neo4j | by Tomaz Bratanic | Aug, 2023


Combine the newly added vector index into LangChain to boost your RAG functions

Because the introduction of ChatGPT six months in the past, the expertise panorama has undergone a transformative shift. ChatGPT’s distinctive capability for generalization has diminished the requirement for specialised deep studying groups and in depth coaching datasets to create customized NLP fashions. This has democratized entry to a variety of NLP duties, akin to summarization and data extraction, making them extra available than ever earlier than. Nonetheless, we quickly realized the limitations of ChatGPT-like models, akin to data date cutoff and never getting access to personal info. In my view, what adopted was the second wave of generative AI transformation with the rise of Retrieval Augmented Era (RAG) functions, the place you feed related info to the mannequin at question time to assemble higher and extra correct solutions.

RAG utility circulate. Picture by the writer. Icons from https://www.flaticon.com/

As talked about, the RAG functions require a wise search device that is ready to retrieve further info primarily based on the consumer enter, which permits the LLMs to provide extra correct and up-to-date solutions. At first, the main focus was totally on retrieving info from unstructured textual content utilizing semantic search. Nonetheless, it quickly grew to become evident {that a} mixture of structured and unstructured information is the very best strategy to RAG functions if you wish to move beyond “Chat with your PDF” applications.

Neo4j was and is a wonderful match for dealing with structured info, nevertheless it struggled a bit with semantic search as a consequence of its brute-force strategy. Nonetheless, the wrestle is previously as Neo4j has introduced a new vector index in version 5.11 designed to effectively carry out semantic search over unstructured textual content or different embedded information modalities. The newly added vector index makes Neo4j an ideal match for many RAG functions because it now works nice with each structured and unstructured information.

On this weblog publish I’ll present you the way to setup a vector index in Neo4j and combine it into the LangChain ecosystem. The code is accessible on GitHub.

Neo4j Surroundings setup

You’ll want to setup a Neo4j 5.11 or higher to comply with together with the examples on this weblog publish. The simplest means is to start out a free occasion on Neo4j Aura, which presents cloud situations of Neo4j database. Alternatively, you may also setup an area occasion of the Neo4j database by downloading the Neo4j Desktop utility and creating an area database occasion.

After you might have instantiated the Neo4j database, you need to use the LangChain library to hook up with it.

from langchain.graphs import Neo4jGraph

NEO4J_URI="neo4j+s://1234.databases.neo4j.io"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="-"

graph = Neo4jGraph(
url=NEO4J_URI,
username=NEO4J_USERNAME,
password=NEO4J_PASSWORD
)

Organising the Vector Index

Neo4j vector index is powered by Lucene, the place Lucene implements a Hierarchical Navigable Small World (HNSW) Graph to carry out a approximate nearest neighbors (ANN) question over the vector house.

Neo4j’s implementation of the vector index is designed to index a single node property of a node label. For instance, for those who needed to index nodes with the label Chunk on their node property embedding , you’d use the next Cypher process.

CALL db.index.vector.createNodeIndex(
'wikipedia', // index title
'Chunk', // node label
'embedding', // node property
1536, // vector dimension
'cosine' // similarity metric
)

Together with the index title, node label, and property, you will need to specify the vector dimension (embedding dimension), and the similarity metric. We shall be utilizing OpenAI’s text-embedding-ada-002 embedding mannequin, which makes use of vector dimension 1536 to characterize textual content within the embedding house. For the time being, solely the cosine and Euclidean similarity metrics can be found. OpenAI suggests utilizing the cosine similarity metric when utilizing their embedding mannequin.

Populating the Vector index

Neo4j is schema-less by design, which suggests it doesn’t implement any restrictions what goes right into a node property. For instance, the embedding property of the Chunk node might retailer integers, record of integers and even strings. Let’s do that out.

WITH [1, [1,2,3], ["2","5"], [x in range(0, 1535) | toFloat(x)]] AS exampleValues
UNWIND vary(0, dimension(exampleValues) - 1) as index
CREATE (:Chunk {embedding: exampleValues[index], index: index})

This question creates a Chunknode for every aspect within the record and makes use of the aspect because the embeddingproperty worth. For instance, the primary Chunk node may have the embedding property worth 1, the second node [1,2,3], and so forth. Neo4j doesn’t implement any guidelines on what you’ll be able to retailer below node properties. Nonetheless, the vector index has clear directions about the kind of values and their embedding dimension it ought to index.

We will check which values had been listed by performing a vector index search.

CALL db.index.vector.queryNodes(
'wikipedia', // index title
3, // topK neighbors to return
[x in range(0,1535) | toFloat(x) / 2] // enter vector
)
YIELD node, rating
RETURN node.index AS index, rating

For those who run this question, you’re going to get solely a single node returned, though you requested the highest 3 neighbors to be returned. Why is that so? The vector index solely indexes property values, the place the worth is a listing of floats with the required dimension. On this instance, just one embeddingproperty worth had the record of floats sort with the chosen size 1536.

A node is listed by the vector index if all the next are true:

  • The node incorporates the configured label.
  • The node incorporates the configured property key.
  • The respective property worth is of sort LIST<FLOAT>.
  • The size() of the respective worth is identical because the configured dimensionality.
  • The worth is a legitimate vector for the configured similarity perform.

Integrating the vector index into the LangChain ecosystem

Now we’ll implement a easy customized LangChain class that may use the Neo4j Vector index to retrieve related info to generate correct and up-to-date solutions. However first, now we have to populate the vector index.

Knowledge circulate utilizing the Neo4j vector index in RAG functions. Picture by the writer. Icons from flaticons.

The duty will include the next steps:

  • Retrieve a Wikipedia article
  • Chunk the textual content
  • Retailer the textual content together with its vector illustration in Neo4j
  • Implement a customized LangChain class to help RAG functions

On this instance, we’ll fetch solely a single Wikipedia article. I’ve determined to make use of Baldur’s Gate 3 page.

import wikipedia
bg3 = wikipedia.web page(pageid=60979422)

Subsequent, we have to chunk and embed the textual content. We are going to break up the textual content by part utilizing the double newline delimiter after which use OpenAI’s embedding mannequin to characterize every part with an acceptable vector.

import os
from langchain.embeddings import OpenAIEmbeddings

os.environ["OPENAI_API_KEY"] = "API_KEY"

embeddings = OpenAIEmbeddings()

chunks = [{'text':el, 'embedding': embeddings.embed_query(el)} for
el in bg3.content.split("nn") if len(el) > 50]

Earlier than we transfer on to the LangChain class, we have to import the textual content chunks into Neo4j.

graph.question("""
UNWIND $information AS row
CREATE (c:Chunk {textual content: row.textual content})
WITH c, row
CALL db.create.setVectorProperty(c, 'embedding', row.embedding)
YIELD node
RETURN distinct 'completed'
""", {'information': chunks})

One factor you’ll be able to discover is that I used the db.create.setVectorProperty process to retailer the vectors to Neo4j. This process is used to confirm that the property worth is certainly a listing of floats. Moreover, it has the additional benefit of lowering the cupboard space of vector property by roughly 50%. Subsequently, it’s endorsed at all times to make use of this process to retailer vectors to Neo4j.

Now we are able to go and implement the customized LangChain class used to retrieve info from Neo4j vector index and use it to generate solutions. First, we’ll outline the Cypher assertion used to retrieve info.

vector_search = """
WITH $embedding AS e
CALL db.index.vector.queryNodes('wikipedia',3, e) yield node, rating
RETURN node.textual content AS end result
ORDER BY rating DESC
LIMIT 3
"""

As you’ll be able to see, I’ve hardcoded the index title and the ok variety of neighbors to retrieve. You may make this dynamic by including acceptable parameters if you want.

The customized LangChain class is carried out fairly easy.

class Neo4jVectorChain(Chain):
"""Chain for question-answering towards a Neo4j vector index."""

graph: Neo4jGraph = Area(exclude=True)
input_key: str = "question" #: :meta personal:
output_key: str = "end result" #: :meta personal:
embeddings: OpenAIEmbeddings = OpenAIEmbeddings()
qa_chain: LLMChain = LLMChain(llm=ChatOpenAI(temperature=0), immediate=CHAT_PROMPT)

def _call(self, inputs: Dict[str, str], run_manager) -> Dict[str, Any]:
"""Embed a query and do vector search."""
query = inputs[self.input_key]

# Embed the query
embedding = self.embeddings.embed_query(query)

# Retrieve related info from the vector index
context = self.graph.question(
vector_search, {'embedding': embedding})
context = [el['result'] for el in context]

# Generate the reply
end result = self.qa_chain(
{"query": query, "context": context},
)
final_result = end result[self.qa_chain.output_key]
return {self.output_key: final_result}

I’ve omitted some boilerplate code to make it extra readable. Basically, when you’ll be able to name the Neo4jVectorChain, the next steps are executed:

  1. Embed the query utilizing the related embedding mannequin
  2. Use the textual content embedding worth to retrieve most comparable content material from the vector index
  3. Use the offered context from comparable content material to generate the reply

We will now check our implementation.

vector_qa = Neo4jVectorChain(graph=graph, embeddings=embeddings, verbose=True)
vector_qa.run("What's the gameplay of Baldur's Gate 3 like?")

Response

Generated response. Picture by the writer.

By utilizing the verbose choice, you may also consider the retrieved context from the vector index that was used to generate the reply.

Abstract

Leveraging Neo4j’s new vector indexing capabilities, you’ll be able to create a unified information supply that powers Retrieval Augmented Era functions successfully. This lets you not solely implement “Chat along with your PDF or documentation” options but in addition to conduct real-time analytics, all from a single, sturdy information supply. This multi-purpose utility can streamline your operations and enhances information synergy, making Neo4j an ideal resolution for managing each structured and unstructured information.

As at all times, the code is accessible on GitHub.


The May of Knowledge Literacy. Is that this the Key to Profitable Use of… | by Michał Szudejko | Aug, 2023

A Temporary Introduction to SciKit Pipelines | by Jonte Dancker | Aug, 2023