in

Quantize Llama fashions with GGML and llama.cpp


GGML vs. GPTQ vs. NF4

Picture by writer

As a result of large dimension of Massive Language Fashions (LLMs), quantization has turn out to be a necessary method to run them effectively. By lowering the precision of their weights, it can save you reminiscence and pace up inference whereas preserving many of the mannequin’s efficiency. Lately, 8-bit and 4-bit quantization unlocked the potential of working LLMs on shopper {hardware}. Coupled with the discharge of Llama fashions and parameter-efficient strategies to fine-tune them (LoRA, QLoRA), this created a wealthy ecosystem of native LLMs that are actually competing with OpenAI’s GPT-3.5 and GPT-4.

In addition to the naive strategy lined in this article, there are three essential quantization strategies: NF4, GPTQ, and GGML. NF4 is a static methodology utilized by QLoRA to load a mannequin in 4-bit precision to carry out fine-tuning. In a previous article, we explored the GPTQ methodology and quantized our personal mannequin to run it on a shopper GPU. On this article, we are going to introduce the GGML method, see the right way to quantize Llama fashions, and supply ideas and tips to realize the most effective outcomes.

You will discover the code on Google Colab and GitHub.

GGML is a C library centered on machine studying. It was created by Georgi Gerganov, which is what the initials “GG” stand for. This library not solely offers foundational parts for machine studying, resembling tensors, but in addition a distinctive binary format to distribute LLMs.

This format not too long ago modified to GGUF. This new format is designed to be extensible, in order that new options shouldn’t break compatibility with present fashions. It additionally centralizes all of the metadata in a single file, resembling particular tokens, RoPE scaling parameters, and so on. In brief, it solutions a couple of historic ache factors and ought to be future-proof. For extra data, you’ll be able to learn the specification at this address. In the remainder of the article, we are going to name “GGML fashions” all fashions that both use GGUF or earlier codecs.

GGML was designed for use along side the llama.cpp library, additionally created by Georgi Gerganov. The library is written in C/C++ for environment friendly inference of Llama fashions. It will probably load GGML fashions and run them on a CPU. Initially, this was the primary distinction with GPTQ fashions, that are loaded and run on a GPU. Nevertheless, now you can offload some layers of your LLM to the GPU with llama.cpp. To present you an instance, there are 35 layers for a 7b parameter mannequin. This drastically hastens inference and permits you to run LLMs that don’t slot in your VRAM.

Picture by writer

If command-line instruments are your factor, llama.cpp and GGUF assist have been built-in into many GUIs, like oobabooga’s text-generation-web-ui, koboldcpp, LM Studio, or ctransformers. You’ll be able to merely load your GGML fashions with these instruments and work together with them in a ChatGPT-like manner. Luckily, many quantized fashions are straight accessible on the Hugging Face Hub. You’ll rapidly discover that the majority of them are quantized by TheBloke, a preferred determine within the LLM group.

Within the subsequent part, we are going to see the right way to quantize our personal fashions and run them on a shopper GPU.

Let’s take a look at the recordsdata inside TheBloke/Llama-2–13B-chat-GGML repo. We are able to see 14 totally different GGML fashions, similar to various kinds of quantization. They comply with a specific naming conference: “q” + the variety of bits used to retailer the weights (precision) + a specific variant. Here’s a record of all of the attainable quant strategies and their corresponding use instances, based mostly on mannequin playing cards made by TheBloke:

  • q2_k: Makes use of Q4_K for the eye.vw and feed_forward.w2 tensors, Q2_K for the opposite tensors.
  • q3_k_l: Makes use of Q5_K for the eye.wv, consideration.wo, and feed_forward.w2 tensors, else Q3_K
  • q3_k_m: Makes use of Q4_K for the eye.wv, consideration.wo, and feed_forward.w2 tensors, else Q3_K
  • q3_k_s: Makes use of Q3_K for all tensors
  • q4_0: Unique quant methodology, 4-bit.
  • q4_1: Larger accuracy than q4_0 however not as excessive as q5_0. Nevertheless has faster inference than q5 fashions.
  • q4_k_m: Makes use of Q6_K for half of the eye.wv and feed_forward.w2 tensors, else Q4_K
  • q4_k_s: Makes use of Q4_K for all tensors
  • q5_0: Larger accuracy, increased useful resource utilization and slower inference.
  • q5_1: Even increased accuracy, useful resource utilization and slower inference.
  • q5_k_m: Makes use of Q6_K for half of the eye.wv and feed_forward.w2 tensors, else Q5_K
  • q5_k_s: Makes use of Q5_K for all tensors
  • q6_k: Makes use of Q8_K for all tensors
  • q8_0: Virtually indistinguishable from float16. Excessive useful resource use and sluggish. Not advisable for many customers.

As a rule of thumb, I like to recommend utilizing Q5_K_M because it preserves many of the mannequin’s efficiency. Alternatively, you should utilize Q4_K_M if you wish to avoid wasting reminiscence. On the whole, K_M variations are higher than K_S variations. I can’t advocate Q2 or Q3 variations, as they drastically lower mannequin efficiency.

Now that we all know extra in regards to the quantization sorts accessible, let’s see the right way to use them on an actual mannequin. You’ll be able to execute the next code on a free T4 GPU on Google Colab. Step one consists of compiling llama.cpp and putting in the required libraries in our Python surroundings.

# Set up llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clear && LLAMA_CUBLAS=1 make
!pip set up -r llama.cpp/necessities.txt

Now we are able to obtain our mannequin. We are going to use the mannequin we fine-tuned in the previous article, mlabonne/EvolCodeLlama-7b.

MODEL_ID = "mlabonne/EvolCodeLlama-7b"

# Obtain mannequin
!git lfs set up
!git clone https://huggingface.co/{MODEL_ID}

This step can take some time. As soon as it’s carried out, we have to convert our weight to GGML FP16 format.

MODEL_NAME = MODEL_ID.break up('/')[-1]
GGML_VERSION = "gguf"

# Convert to fp16
fp16 = f"{MODEL_NAME}/{MODEL_NAME.decrease()}.{GGML_VERSION}.fp16.bin"
!python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}

Lastly, we are able to quantize the mannequin utilizing one or a number of strategies. On this case, we are going to use the Q4_K_M and Q5_K_M strategies I advisable earlier. That is the one step that really requires a GPU.

QUANTIZATION_METHODS = ["q4_k_m", "q5_k_m"]

for methodology in QUANTIZATION_METHODS:
qtype = f"{MODEL_NAME}/{MODEL_NAME.decrease()}.{GGML_VERSION}.{methodology}.bin"
!./llama.cpp/quantize {fp16} {qtype} {methodology}

Our two quantized fashions are actually prepared for inference. We are able to examine the scale of the bin recordsdata to see how a lot we compressed them. The FP16 mannequin takes up 13.5 GB, whereas the Q4_K_M mannequin takes up 4.08 GB (3.3 instances smaller) and the Q5_K_M mannequin takes up 4.78 GB (2.8 instances smaller).

Let’s use llama.cpp to effectively run them. Since we’re utilizing a GPU with 16 GB of VRAM, we are able to offload each layer to the GPU. On this case, it represents 35 layers (7b parameter mannequin), so we’ll use the -ngl 35 parameter. Within the following code block, we’ll additionally enter a immediate and the quantization methodology we need to use.

import os

model_list = [file for file in os.listdir(MODEL_NAME) if GGML_VERSION in file]
immediate = enter("Enter your immediate: ")
chosen_method = enter("Please specify the quantization methodology to run the mannequin (choices: " + ", ".be part of(model_list) + "): ")

# Confirm the chosen methodology is within the record
if chosen_method not in model_list:
print("Invalid methodology chosen!")
else:
qtype = f"{MODEL_NAME}/{MODEL_NAME.decrease()}.{GGML_VERSION}.{methodology}.bin"
!./llama.cpp/essential -m {qtype} -n 128 --color -ngl 35 -p "{immediate}"

Let’s ask the mannequin “Write a Python operate to print the nth Fibonacci numbers” utilizing the Q5_K_M methodology. If we take a look at the logs, we are able to verify that we efficiently offloaded our layers due to the road “llm_load_tensors: offloaded 35/35 layers to GPU”. Right here is the code the mannequin generated:

def fib(n):
if n == 0 or n == 1:
return n
return fib(n - 2) + fib(n - 1)

for i in vary(1, 10):
print(fib(i))

This wasn’t a really complicated immediate, however it efficiently produced a working piece of code very quickly. With this GGML, you should utilize your native LLM as an assistant in a terminal utilizing the interactive mode (-i flag). Be aware that this additionally works on Macbooks with Apple’s Metallic Efficiency Shaders (MPS), which is a wonderful choice to run LLMs.

Lastly, we are able to push our quantized mannequin to a brand new repo on the Hugging Face Hub with the “-GGUF” suffix. First, let’s log in and modify the next code block to match your username.

!pip set up -q huggingface_hub

username = "mlabonne"

from huggingface_hub import notebook_login, create_repo, HfApi
notebook_login()

Now we are able to create the repo and add our fashions. We use the allow_patterns parameter to filter which recordsdata to add, so we do not push everything of the listing.

api = HfApi()

# Create repo
create_repo(
repo_id=f"{username}/{MODEL_NAME}-GGML",
repo_type="mannequin",
exist_ok=True
)

# Add bin fashions
api.upload_folder(
folder_path=MODEL_NAME,
repo_id=f"{username}/{MODEL_NAME}-GGML",
allow_patterns=f"*{GGML_VERSION}*",
)

Now we have efficiently quantized, run, and pushed GGML fashions to the Hugging Face Hub! Within the subsequent part, we are going to discover how GGML really quantize these fashions.

The best way GGML quantizes weights just isn’t as refined as GPTQ’s. Mainly, it teams blocks of values and rounds them to a decrease precision. Some strategies, like Q4_K_M and Q5_K_M, implement a increased precision for vital layers. On this case, each weight is saved in 4-bit precision, aside from half of the eye.wv and feed_forward.w2 tensors. Experimentally, this combined precision proves to be an excellent tradeoff between accuracy and useful resource utilization.

If we glance into the ggml.c file, we are able to see how the blocks are outlined. For instance, the block_q4_0 construction is outlined as:

#outline QK4_0 32
typedef struct {
ggml_fp16_t d; // delta
uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;

In GGML, weights are processed in blocks, every consisting of 32 values. For every block, a scale issue (delta) is derived from the most important weight worth. All weights within the block are then scaled, quantized, and packed effectively for storage (nibbles). This strategy considerably reduces the storage necessities whereas permitting for a comparatively easy and deterministic conversion between the unique and quantized weights.

Now that we all know extra in regards to the quantization course of, we are able to examine the outcomes with NF4 and GPTQ.

Which method is best for 4-bit quantization? To reply this query, we have to introduce the totally different backends that run these quantized LLMs. For GGML fashions, llama.cpp with Q4_K_M fashions is the best way to go. For GPTQ fashions, we’ve got two choices: AutoGPTQ or ExLlama. Lastly, NF4 fashions can straight be run in transformers with the --load-in-4bit flag.

Oobabooga ran a number of experiments in a superb blog post that examine totally different fashions by way of perplexity (decrease is best):

Primarily based on these outcomes, we are able to say that GGML fashions have a slight benefit by way of perplexity. The distinction just isn’t significantly important, which is why it’s higher to deal with the technology pace by way of tokens/second. The perfect method depends upon your GPU: when you have sufficient VRAM to suit the whole quantized mannequin, GPTQ with ExLlama would be the quickest. If that’s not the case, you’ll be able to offload some layers and use GGML fashions with llama.cpp to run your LLM.

On this article, we launched the GGML library and the brand new GGUF format to effectively retailer these quantized fashions. We used it to quantize our personal Llama mannequin in numerous codecs (Q4_K_M and Q5_K_M). We then ran the GGML mannequin and pushed our bin recordsdata to the Hugging Face Hub. Lastly, we delved deeper into GGML’s code to know the way it really quantizes the weights and in contrast it to NF4 and GPTQ.

Quantization is a formidable vector to democratize LLMs by reducing the price of working them. Sooner or later, combined precision and different strategies will hold bettering the efficiency we are able to obtain with quantized weights. Till then, I hope you loved studying this text and discovered one thing new.

In the event you’re fascinated about extra technical content material round LLMs, follow me on Medium.

The best way to Design a Roadmap for a Machine Studying Undertaking | by Heather Couture | Sep, 2023

Constructing AI merchandise with a holistic psychological mannequin | by Janna Lipenkova | Sep, 2023