_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Quantizace modelů — GPTQ, GGUF, AWQ

28. 04. 2024 4 min read intermediate

Quantization is a key technique for making large language models accessible to the broader public. It enables dramatic reduction of model memory requirements while maintaining their performance using methods like GPTQ, GGUF, or AWQ.

What is Model Quantization

Quantization is a technique for reducing neural network size by lowering data type precision. Instead of standard 32-bit float values, we use 16-bit, 8-bit, or even 4-bit representations. The result is dramatic reduction in memory requirements and inference acceleration, often with minimal quality loss.

For language models like GPT, quantization is crucial - a 70B parameter model takes approximately 140 GB of memory in FP16. After 4-bit quantization, it drops to ~35 GB, enabling execution on consumer hardware.

GPTQ: Post-training Quantization with Calibration

GPTQ (GPT Quantization) is an advanced post-training quantization method focused on minimizing compression errors. It uses a small calibration dataset to optimize quantization parameters for each layer.

GPTQ Functioning Principle

GPTQ solves an optimization problem for each layer independently. For weight matrix W, it seeks a quantized version Q(W) that minimizes the Frobenius norm of the difference:

min ||WX - Q(W)X||²_F

The algorithm proceeds block by block and uses the Hessian matrix for better quantization error approximation.

GPTQ Implementation

For quantization using GPTQ, we can use the AutoGPTQ library:

from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

# Load original model
model_name = "microsoft/DialoGPT-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Quantization configuration
quantize_config = BaseQuantizeConfig(
    bits=4,  # 4-bit quantization
    group_size=128,
    desc_act=False,
)

# Model quantization
model = AutoGPTQForCausalLM.from_pretrained(
    model_name,
    quantize_config=quantize_config,
    device="cuda:0"
)

# Prepare calibration data
calibration_data = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is transforming technology.",
    # more calibration sentences...
]

# Run quantization
model.quantize(calibration_data)

# Save quantized model
model.save_quantized("./quantized_model")

GPTQ Advantages and Disadvantages

Advantages:

  • High quantization quality due to calibration
  • Support for various bit-widths (2, 3, 4, 8 bit)
  • Optimized for GPU inference

Disadvantages:

  • Slow quantization process
  • Requires calibration data
  • Primarily for NVIDIA GPUs

GGUF: Universal Format for CPU Inference

GGUF (GPT-Generated Unified Format) is a new binary format developed by the llama.cpp team. It replaced older GGML formats and is optimized for efficient loading and CPU inference.

GGUF Format Structure

GGUF contains metadata, tensor information, and data itself in one file. It supports various quantization schemes labeled as Q4_0, Q5_1, Q8_0, etc., where the number indicates bit count and the letter indicates algorithm variant.

# Convert model to GGUF format
python convert.py --outfile model.gguf \
                  --outtype q4_0 \
                  ./original_model/

# Run with llama.cpp
./main -m model.gguf \
       -n 128 \
       -p "Explain quantum computing:" \
       -t 8  # number of threads

Working with GGUF in Python

For working with GGUF models, we can use the llama-cpp-python wrapper:

from llama_cpp import Llama

# Load GGUF model
llm = Llama(
    model_path="./model.gguf",
    n_ctx=2048,  # context window
    n_threads=8,  # CPU threads
    verbose=False
)

# Text generation
output = llm(
    "Explain the concept of quantization:",
    max_tokens=256,
    temperature=0.7,
    top_p=0.9
)

print(output['choices'][0]['text'])

GGUF Quantization Variants

  • Q4_0: 4-bit quantization, fastest, slight quality loss
  • Q5_1: 5-bit, better quality than Q4_0
  • Q8_0: 8-bit, minimal quality loss
  • Q4_K_M: Mixed precision, optimal speed/quality ratio

AWQ: Activation-aware Weight Quantization

AWQ is a relatively new method that considers the importance of individual weights based on activation patterns during quantization. Instead of uniform quantization of all weights, it protects important channels from degradation.

Key AWQ Principles

AWQ identifies “salient channels” - channels with high activation variance that have the greatest impact on model output. These channels are either quantized with higher precision or remain in original precision.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "mistralai/Mistral-7B-v0.1"
quant_path = "mistral-7b-awq"

# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# AWQ quantization
model.quantize(tokenizer, quant_config={
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
})

# Save quantized model
model.save_quantized(quant_path)

# Load and use
quantized_model = AutoAWQForCausalLM.from_quantized(
    quant_path, 
    device="cuda:0"
)

Method Comparison

Method Quantization Speed Quality Ideal Use
GPTQ Slow High GPU inference, production
GGUF Fast Medium-High CPU inference, edge devices
AWQ Medium Very High GPU inference, best quality

Practical Tips for Method Selection

When choosing a quantization method, consider the following factors:

  • Hardware: GGUF for CPU, GPTQ/AWQ for GPU
  • Latency: GGUF fastest startup, AWQ fastest GPU inference
  • Memory: All methods dramatically reduce memory requirements
  • Quality: AWQ > GPTQ > GGUF (generally)

For developers, I recommend experimenting with different bit-width combinations. 4-bit quantization usually provides the best compression/quality ratio, while 8-bit preserves nearly original performance.

Summary

Quantization is an essential technique for deploying large language models on regular hardware. GPTQ offers excellent quality for GPU applications, GGUF is ideal for CPU inference and experimentation, while AWQ represents current state-of-the-art in quality preservation. Choosing the right method depends on specific application requirements, available hardware, and tolerance for speed/quality trade-offs.

quantizacegptqgguf
Share:

CORE SYSTEMS tým

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.