# GGUF Calibration for Activation Encodings

This tutorial explains how to use the `GGUFCalibrator` to generate activation-calibrated encodings from a
GGUF model. Running calibration before building with the [GenAI Builder for GGUF](https://docs.qualcomm.com/doc/80-87189-2/topic/gguf_builder.html) improves
on-device inference performance by enabling tighter integer-kernel selection during compilation.

Warning

Activation calibration is a **performance-accuracy trade-off**. Enabling it typically improves throughput
and reduces latency on-device, but may introduce a slight degradation in model accuracy compared to the
default float-activation path. Evaluate the accuracy impact on your target task before deploying.

## When to use this workflow

GGUF files store **weight-only** block-quantized values (e.g. Q4\_0, Q4\_K). Activations are left in floating
point. Performing activation calibration allows the compiler to select tighter integer kernels that better
match the runtime distribution of the model, which typically improves throughput and reduces latency.

Note

Activation quantization is only supported for GGUF models whose tensors are quantized using the
**Q4\_0** or **Q3\_K** data types. Running calibration on models with other quantization types will
not yield the expected performance gains.

Use this workflow when improved on-device performance is the primary goal and a slight degradation in
model accuracy is acceptable. If your task requires maximum accuracy, use the
[GenAI Builder for GGUF](https://docs.qualcomm.com/doc/80-87189-2/topic/gguf_builder.html) workflow without calibration.

## Configurations

- Host OS: Linux (x86\_64)
- Target Devices: Snapdragon Android Device
- Processor: Qualcomm NPU
- Backend: HTP

## Step 1: Setup

See [Setup instructions](https://docs.qualcomm.com/doc/80-87189-2/topic/setup.html) to configure the environment.
A machine with at least 128 GB of RAM is recommended. If you don’t have sufficient RAM, increase your swap
memory.

The calibration workflow requires two additional dependencies that aren’t part of the base QAIRT installation.
Install them before proceeding:

pip install aimet_torch==2.21.0 datasets==3.0.0
    Copy to clipboard

import os
    import pathlib as pl
    
    from qairt.modules.gguf_module.calibration import GGUFCalibrator, GGUFCalibrationConfig
    
    # Path to the GGUF model file
    model_path = "./Llama-3.2-1B-Q4_0.gguf"
    Copy to clipboard

Tip

Set the environment variable **QAIRT\_TMP\_DIR** to define an alternative default temporary directory.
Calibration creates intermediate DLC graphs and torch artifacts that can consume several gigabytes of
temporary space for large models.

os.environ["QAIRT_TMP_DIR"] = "./llm_scratch/"
    Copy to clipboard

Tip

Building can be time and memory consuming. The `GGUFCalibrator` supports a caching mechanism that
stores intermediate artifacts (torch models, calibration encodings, processed weights) so that
individual steps can be resumed without recomputation. Define a `CACHE_ROOT` directory and pass it
to the calibrator.

CACHE_ROOT = "./llama3.2_calibration_cache/"
    SAVE_PATH  = "./llama3.2_encodings/"
    Copy to clipboard

## Step 2: Configure the Calibrator

`GGUFCalibrationConfig` exposes the key knobs for the calibration run.

| Parameter | Default | Description |
| --- | --- | --- |
| `arn` | `73` | Auto-regressive length — the sequence length used during calibration. Increasing this captures<br>longer-range activation patterns at the cost of more compute. |
| `context_length` | `4096` | Maximum context length of the model. Must match the value used when building the model. |
| `num_iterations` | `200` | Number of forward passes used to collect activation statistics. More iterations improve encoding<br>quality at the cost of longer runtime. |
| `torch_model_name` | `"ConvertedModel"` | Internal name for the converted torch model artifacts. |
| `quantizer_class` | `ActivationQuantizer` | Quantizer used for activation calibration. Must be a subclass of `BaseQuantizer`. |

config = GGUFCalibrationConfig(
        arn=73,
        context_length=4096,
        num_iterations=200,
    )
    Copy to clipboard

Note

All parameters are validated on construction. Passing a non-positive integer for `arn`,
`context_length`, or `num_iterations` will raise a `ValueError`.

## Step 3: Run Calibration

Instantiate `GGUFCalibrator` and call `generate()`. The calibrator will run the full pipeline and
return the path to the saved `.encodings` file.

Note

Calibration is performed using the [wikitext-2-raw-v1](https://huggingface.co/datasets/Salesforce/wikitext)
subset of the WikiText dataset. The dataset used for calibration is fixed and is not currently user-configurable.

calibrator = GGUFCalibrator(model_path, config=config, cache_dir=CACHE_ROOT)
    encoding_path: pl.Path = calibrator.generate(save_path=SAVE_PATH, filename="gguf_processed")
    
    print(f"Encodings saved to: {encoding_path}")
    Copy to clipboard

The output should look like:

Encodings saved to: ./llama3.2_encodings/gguf_processed_MHA_32_ar73.encodings
    Copy to clipboard

## Step 4: Use the Encodings in the GenAI Builder for GGUF

Pass the `.encodings` file generated above into the builder workflow described in
[GGUF Inference on HTP](https://docs.qualcomm.com/doc/80-87189-2/topic/gguf_builder.html).

from qairt.gen_ai_api.gen_ai_builder_factory import GenAIBuilderFactory
    from qairt.gen_ai_api.builders.gen_ai_builder_htp import GenAIBuilder
    from qairt.api.compiler.config import CompileConfig
    from qairt.api.common.backends.htp.config import HtpGraphConfig
    
    llama_builder: GenAIBuilder = GenAIBuilderFactory.create(
        model_path, "HTP", cache_root="./llama3.2_cache/"
    )
    
    graph_config = [HtpGraphConfig(name="model", fp16_relaxed_precision=1, optimization_type=3)]
    compile_config = CompileConfig(
        backend="HTP",
        soc_details="chipset:SM8750;dsp_arch:v79;soc_model:69",
        graph_custom_configs=graph_config,
    )
    llama_builder.set_compilation_options(compile_config)
    
    # Inject the calibrated encodings
    llama_builder.encodings_path = str(encoding_path)
    
    container = llama_builder.build()
    Copy to clipboard

Last Published: May 26, 2026

[Previous Topic
GGUF Inference on HTP](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/gguf_builder.md) [Next Topic
Low-Rank Adaptation (LoRA) Tutorial](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/lora_tutorial.md)