LLM Architectures — the Building Blocks

LLMs are the new orange and allow individual developers to easily develop capabilities that until a year or two ago were the property of large ML teams, at best.

In the next series of posts, we will try to characterize, from the software engineering side, what the architecture of LLM systems means, that is: when we build systems that utilize LLM — what are the common and logical structures of the software around them.

Let’s start with a basic example

At the most basic level, a system using LLM looks like this:

Our system sends a prompt (text) to the LLM model. The model can be SaaS (such as OpenAI) or managed in self-hosting like various open-source models.

The LLM will process the prompt — and return a textual response.

Of course, since we didn’t do anything special with the prompt, we created yet another chat application in the form of ChatGPT or Gemini. We could add a nicer UI, or ringtones at the right moments but at the base, this is just another Chat app — not something new.

There are four principal ways to influence the LLM model — to introduce new capabilities (marked in the diagram in blue shades):

Training a new model, and fine-tuning an existing model are efforts that require a lot of resources and expertise. Tens of thousands of dollars to fine-tune a small model to many millions to train a new large model.

It seems that about 99% of the LLM-based solutions are based on the two simple and available practices: Prompt Engineering and RAG — which we will focus on in the post. These are very cheap techniques, which can be used with a small investment and without intimate LLM knowledge.

Agentic workflows are a series of Patterns of using Prompt Engineering and RAG to achieve better results. We will discuss them later.

We will start by describing a minimal LLM-based solution, one that a single developer can develop, in a short time. We will do this with the most basic tool — Prompt Engineering.

Pay attention to the following Prompt template:

prompt_template = """You are a teacher grading a question. 
You are given a question, the student's answer, and the true answer, 
and are asked to score the student answer as either Correct or Incorrect.

Grade the student answers based ONLY on their factual accuracy. 
Ignore differences in punctuation and phrasing between the student answer 
and true answer. 
It is OK if the student answer contains more information than the true answer,
as long as it does not contain any conflicting statements. 
If the student answers that there is no specific information provided in 
the context, then the answer is Incorrect. Begin! 

QUESTION: {question}
STUDENT ANSWER: {studentAnswer}
TRUE ANSWER: {trueAnswer}
GRADE:

Your response should be as follows:

GRADE: (Correct or Incorrect)
(line break)
JUSTIFICATION: (Without mentioning the student/teacher framing of this prompt, 
explain why the STUDENT ANSWER is Correct or Incorrect. Use up to five 
sentences maximum. Keep the answer as concise as possible.)
"""

Code source.

If we place values for the parameters question, studentAnswer, and trueAnser in the correct places and send it to LLM to process the answer — it will return to us an analysis of whether the answer is correct or not, and an explanation of why.

In practice: we created a focused and useful function, which did not exist before — checking the correctness of an answer.

For a question in history, such as “List the main causes of the outbreak of the First World War” — we can solve with a little effort a problem that will be very difficult to solve using imperative programming (e.g. java/python code). Analyzing the text, understanding its semantics and matching it to the example text (the correct answer) is a difficult problem to solve using conventional programming, and even challenging using common NLP (Natural Language Processing) techniques.

With the help of LLM and a little prompt engineering — this becomes a simple task.

Prompt Engineering is the art of choosing the exact instructions and wording in the prompt and their full effect on the LLM model. To achieve a good result, both knowledge is needed, and several trial and error iterations.

Please note that the above solution does not rely on the LLM’s knowledge of history (knowledge that it happens to have) but rather the ability to compare a student’s answer with the answer we defined as correct, in a deep semantic manner. This also allows us to determine what the correct answer is from our point of view and also allows us to deal with topics in which the LLM does not have sufficient knowledge (for example: the municipal bylaws of the Tel Aviv Municipality). The LLM is used here mainly as a language expert and less as a source of knowledge. We inject specific relevant historical knowledge in the form of trueAnswer.

An important point: If we want to use the above prompt to check the answer to questions such as “In what year was the State of Israel established?” — we are probably doing an injustice here. There is no need of an LLM to check such an answer. LLMs are expensive to use (amount of compute, latency), are not deterministic, and require maintenance to preserve them from drifts in results over time. It is better to do simple pattern matching to check such an answer.

In conclusion

We used LLM to produce a useful function that can be integrated into an existing system. A function that if we had tried to develop without LLM — it is doubtful whether we would have succeeded with the same quality and in a similar development time. So far this is not an ״LLM system״ but a system that has a function that uses LLM function, a grading system.

LLM Agents

The ability to integrate LLM-enhanced functions into existing systems and provide new additional value — is exciting by itself, and should be widely used wherever it is useful!

LLM opens up greater opportunities than writing smarter functions, such as letting LLM manage complete and complex processes. The world is now talking about Agents — smart agents who will know how to perform a variety of functions similar to a professional specializing in a certain field. For example, an agent who is a real estate attorney agent.

To remind: language models are trained on a large amount of public information from the Internet (such as Wikipedia), and other third-party or private databases that are available to the model trainers. They contain very broad knowledge but usually do not specialize in-depth on specific topics. ChatGPT will be able to answer general questions about legal and real estate — but it is far from being the equivalent of an experienced attorney in the field. These gaps can be reduced with the help of certain tools, such as RAG.

RAG allows me to “inject” into the prompt more relevant information that does not exist in the model (ie: the model was not trained on) in a dynamic way tailored to the specific request. RAG allows to turn a generic LLM-based chat into a chat that manages to answer well in-depth questions in a certain field (e.g. real estate).

Answering questions is still a rather limited use, compared to what human experts do. We would like our LLM agents (as an Ideal) to handle any task that a human professional would handle. For example :

Provide a legal opinion on a certain case and suggest possible courses of action.
Respond to another attorney’s legal opinion — in order to convince a certain position.
Contact the customer and collect additional details from her to prepare the legal opinion.
Suggest the next step in the management of the legal case.
Manage the legal case end-to-end until the case is won.

These functional are a step up from “smart chat that knows a lot”, but with the help of additional tools — we can start seeing the way getting there.

The vision continues even further to a “Multi-Agents architecture” in which a system composed of intelligent LLM agents that communicate with one another to perform complex tasks in a manner similar to the way human organizations operate.

In practice, these ideas have not yet proven themselves at scale. It is possible that either these other architectures will lead the way — it is still too early to say.

Retrieval Augmented Generation (RAG)

Let’s assume I’m building a chat for a website that organizes knowledge around restaurants — and I want to answer certain questions for customers. For example:

LLM, as a model, does not have enough knowledge to answer the question :

It does not know who the user is, and where he is located. Which branch to refer to?
It does not know about the specific chain of restaurants “Burger Saloon“ — as it was not included in its training data. Nevertheless, it was able to understand the context of the “burger saloon” name, and give a logical answer — which is impressive!
It does not know the opening hours of the specific restaurant — information that can be updated all the time.
It doesn’t know the current date, that is, even if he knew the opening hours of the restaurant — he wouldn’t be able to provide a natural answer such as “Today they are open until 8:00 PM, but tomorrow from nine in the morning until midnight”.

We could add some more context to the prompt, using prompt engineering, which would enrich the model with more information. For example: the operating hours of all the restaurants we know of. It could improve the answer, but :

The prompt would be rather large and therefore expensive to process (costs, latency)
The bigger the prompt, the more the LLM`s attention is scattered — which can lead to inferior results, such as providing the opening hours of another restaurant.

Hence, the logical step is to provide a dynamic prompt — with information targeted to the specific query. This is exactly what the RAG process does:

We receive the prompt from the user.
We transfer it to the RAG process to try and improve it (enrich the context)
We analyze the prompt to understand what context is required. Example: search for keywords in the prompt, such as the restaurant name, “Burger Saloon“.
We extract from the database (aka. “Knowledge Base”) that we prepared in advance, information and the current hours of operation of Burger Saloon.
We enrich the prompt with relevant context:
- Today’s date
- Location of the user (we extracted geoLocation from the browser)
- General information about the Burger Saloon restaurant and current opening hours
We will send the improved prompt to LLM — in order to get a quality answer, which we would not have been able to get without the RAG process.

The improved prompt can look like this:

INFORMATION:
current time is {datetime}
the user current location is {address}
Information about the restaurant:
{restaurant_general_info}
{restaurant_opening_hours}

QUERY:
{user_query}

INSTRUCTIONS:
You are a helpful agent providing information about restaurants.
Answer the users QUERY using the INFORMATION text above. Keep your answer 
ground in the facts of the INFORMATION. If the INFORMATION doesn’t contain 
the facts to answer the QUERY answer you don't know.

RAG is a world in itself, which includes different techniques on how to choose which information to retrieve, how to choose how much information to retrieve, how to sort it in the prompt, how to index the information, and how to monitor and improve the retrieval process. Time is short to discuss all these topics now.

Tools Use

Let’s go back to the exam appraiser system we created at first.

We’ll see a situation where LLM is not only more expensive — but far less successful in providing good results. For example: checking the solution of a mathematical problem.

LLM models are very good at answering general and abstract questions, but much less successful when precision is required. The LLM models don’t actually understand math, they are just stochastic parrots that summarize and repeat billions of conversations they have been exposed to— with a rare talent to understand the context and generate a (seemingly) logical answer.

We can train LLM models on more mathematical examples to improve their math skills, but even then — they have a glass ceiling, especially in operations with large numbers — where the density of examples per specific number drops.

*An article showing the significant improvement in a model trained on mathematical data — but also the decline in results as the numbers go larger (number of digits). Source: LLMs Bad at Math*.

Today, if we ask chatGPT to solve a math problem with a large number — most likely we will receive an accurate answer:

The improvement we see is not an improvement in the GPT model — but an improvement in the chatGPT application. The application provides the model with a mathematical calculation tool that it can use.

As soon as the model recognizes that it has to solve a mathematical problem, it activates the external tool — and then integrates the result of the tool into its answer.

The tool itself does not use LLM. This is a tool better developed with plain imperative coding — one that makes it easy to perform accurate mathematical calculations, and very efficiently. A tool is a compensation to the model — at points where the model is weak. Often, this compensation works very well.

We see LLM is not a “silver bullet” for every problem, but its combination with classic tools (imperative code, i.e. a function in Java / Python / etc.) advances us towards the vision of a smart and versatile LLM agent.

The system can use multiple tools. Some can be based on imperative coding, some are based on calling third-party systems (e.g. a weather forecasting system) and others can be based on classical ML models that are more suitable for a given task (e.g. linear regression).

The selection of which tool (being a somewhat abstract task, given the prompt) to use can also be done by LLM. Here is an example prompt:

system_prompt = """
You are an assistant that has access to the following set of tools. 
Here are the names and descriptions for each tool:

{rendered_tools}

None - if the query require none of the tools above.

Given the user input, return the name of the most appropaite tool to 
respond to the query effectively."""

At {rendered_tools} we will plant a map describing the names of the tools as well the textual descriptions of what each tool does and when it is most appropriate to use it. Prompt engineering is key here in selecting the best wording and getting the best results.

In the next step, we will check which arguments the selected tool requires — and we can ask the LLM to extract the required arguments from the conversation. For example: the number for which we want to calculate the square root.

From here, we can run the tool (math calculator function) — and combine the result in the answer to the user. We combined the understanding of the text and the context of the LLM with a precise calculation that the LLM is not successful at.

Tools can provide a variety of dynamic complements to the limitations of the LLM. For example: RAG. Instead of understanding in advance what information will be needed— and adding it as context to the prompt, sometimes it is more effective to let LLM deduce what information is missing — and let it ask for it dynamically from available Tools. For example: A tool that provides the operating hours of a restaurant.

Other tools perform actions. For example: LLM is a closed model and cannot make a call to the restaurant’s APIs to submit an order. On the other hand, the LLM can conclude that the customer wants to place an order, so it should run a Tool that calls the restaurant’s API and does so.

The use of tools expands and diversifies the capabilities of the LLM and brings us closer to the vision of the LLM Agent. We are still missing some tools, which we will talk about in the next post.

In conclusion

LLM is a “kind of magic”, a breakthrough in our ability to solve problems using Software. Although the models are getting better and better, LLM in its generic form will solve very few problems. To solve diverse problems we need to complement the LLM with software engineering (prompt engineering, RAG, tools) and build LLM-based solutions.

The vision discussed today is LLM Agents who will replace human experts in their field, at least partially. The way there — is heel to toe — building improving more capacities one by one. The progress in this field is nothing short of amazing, both in terms of the models that are getting better and better (accuracy, costs) and in terms of the engineering techniques that allow harnessing the models and turning them into solutions to business problems.

After dealing a bit with the subject in the last few months, I recommend putting less emphasis on planning Multi-Agents systems and more on building business capabilities one by one and exploring how the system evolves in the best way. Additionally, many problems can be solved with the help of an LLM — which is pretty cool, but not always worthwhile. LLM-based functions, especially complex ones, require ongoing monitoring and maintenance. The model is stochastic and not always predictable or logical. This should be monitored and mitigated all the time. Many changes that we did not expect to have an impact on the LLM function result, may do so, such as updating the prompt a little, adding data to the knowledge base, or switching to a more advanced model.

Like always in software engineering, it is important to choose the right tool for the task, and not to be blinded by the Buzz (even though it is amazing, this time). It is often better to use a classic ML model or a simple but good enough code algorithm — to avoid maintaining an LLM function that is easy to write — but requires continuous maintenance.

Good luck!

This piece was originally published on Lior Bar-On’s blog, Softwarearchiblog.