# GGUF Inference on HTP

This tutorial offers a detailed, step-by-step explanation of the procedures involved in building and deploying a Pre-Quantized Large Language Model (LLM) in [GGUF format](https://github.com/ggml-org/ggml/blob/master/docs/gguf.md) on a Snapdragon Device.

Configurations:

> 
> 
> - Host OS: Linux (x86\_64) with Android Debug Bridge (ADB) installed.
> - Target Devices: Snapdragon Android Device
> - Processor: Qualcomm NPU
> - Backend: HTP

## Step 1: Setup

Refer to the [Setup instructions](https://docs.qualcomm.com/doc/80-87189-2/topic/setup.html) to configure the environment.
It is recommended to use a machine with at least 128 GB of RAM for timely completion of the workflow. If you do not have sufficient RAM, increase
your swap memory.

> 
> 
> - The [Huggingface hub](https://huggingface.co/models?library=gguf&amp;sort=trending) offers a range of publicly available GGUF models.
> - At present, models quantized using the following schemes are supported:
> 
> 
> 
> > 
> > 
> > - Q2\_K
> >     - Q3\_K
> >     - Q4\_0
> >     - Q4\_1
> >     - Q4\_K
> >     - Q5\_0
> >     - Q5\_1
> >     - Q5\_K
> >     - Q6\_K
> >     - Q8\_0
> - Alternatively, you can use [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) or [llama.cpp](https://github.com/ggml-org/llama.cpp/tree/master) to create a quantized GGUF model.
> - This tutorial uses GGUF model created from [Meta Llama 3.2-1b-instruct model](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) using llama.cpp. Refer to [this section](https://docs.qualcomm.com/doc/80-87189-2/topic/appendix.html#how-to-create-gguf-model) for steps to create a quantized GGUF model with llama.cpp.
> - The name of the GGUF file should follow the naming convention (&lt;BaseName&gt;&lt;SizeLabel&gt;&lt;FineTune&gt;&lt;Version&gt;&lt;Encoding&gt;&lt;Type&gt;.gguf) where each component is delimitated by a “-”, if present.
> - This tutorial uses a **Q4\_0** quantized model and **Snapdragon 8 Elite (SM8750) Android device** to demonstrate the workflow.

import os
    import qairt
    
    from qairt.gen_ai_api.gen_ai_builder_factory import GenAIBuilderFactory
    from qairt.gen_ai_api.builders.gen_ai_builder_htp import GenAIBuilderHTP
    from qairt.gen_ai_api.builders.llama.builder import LlamaBuilderHTP
    from qairt.gen_ai_api.containers.llm_container import LLMContainer
    from qairt.gen_ai_api.containers.gen_ai_container import GenAIContainer
    from qairt.gen_ai_api.executors.t2t_executor import T2TExecutor
    from qairt.api.compiler.config import CompileConfig
    from qairt.api.common.backends.htp.config import HtpGraphConfig
    from qairt.api.transforms.model_transformer_config import (ARn_ContextLengthConfig, MhaConfig,
                                                               SplitModelConfig, ModelTransformerConfig)
    from qairt import Device, DevicePlatformType

    ############################################################
    # Set the path of GGUF model
    model_path = "./Llama-3.2-1B-Q4_0.gguf"
    Copy to clipboard

Tip

Set the environment variable **QAIRT\_TMP\_DIR** to define an alternative default temporary directory path.
This is recommended because temporary artifacts are created during the build process below which may consume temp memory entirely.

os.environ["QAIRT_TMP_DIR"] = "./llm_scratch/"
    Copy to clipboard

Tip

Building can be time and memory consuming, especially for large models. It may be beneficial to stop/resume the building process between steps.
To aid that workflow, the GenAI Builder API provides a caching mechanism to help store intermediate artifacts.  Define your cache root here, and
it will be passed into the factory in Step 2.

CACHE_ROOT = "./llama3.2_cache/"
    Copy to clipboard

Tip

For models with a large number of parameters(such as 7 billion), chat/base models or reasoning models, execution on the device may fail due to exceeding the default adb timeout (300 seconds).
To avoid this, set the environment variable **ADB\_DEFAULT\_TIMEOUT** to a higher value.

os.environ["ADB_DEFAULT_TIMEOUT"] = "1000"
    Copy to clipboard

Note

For better on-device performance, you can run activation calibration on the GGUF model before building.
Calibration enables tighter integer-kernel selection during compilation, which typically improves
throughput and reduces latency. Note that this is a performance–accuracy trade-off and may introduce
a slight degradation in model accuracy. See [GGUF Calibration](https://docs.qualcomm.com/doc/80-87189-2/topic/gguf_calibration.html) for the full
workflow and how to pass the resulting encodings into the builder.

## Step 2: Obtain a GenAIBuilder instance

Create a builder instance using the optional cache root to store intermediate artifacts.

llama_builder: GenAIBuilderHTP = GenAIBuilderFactory.create(model_path, "HTP", cache_root=CACHE_ROOT)
    Copy to clipboard

The factory will inspect the config.json file for the model and determine which builder is appropriate.
For this example it will determine that the LlamaBuilderHTP instance is appropriate, and return a
constructed instance of that subclass. Below, an assert condition is added, which should confirm that
the builder is of the correct type.

assert isinstance(llama_builder, LlamaBuilderHTP)
    Copy to clipboard

## Step 3: Customize the GenAIBuilder

To enable Ahead-of-Time (AOT) compilation tailored for the target device, the builder’s compilation configuration must be customized.
This example configures the compilation for a Snapdragon SD 8 Elite (SM8750) device by specifying the appropriate backend and SoC details.
GGUF models contain only block-quantized weights, which correspond to the W4Afp16 quantization format. To support the compilation of such models,
it is essential to configure the graph to enable FP16 kernel execution.

# Configure compilation options for W4Afp16
    graph_config = [HtpGraphConfig(name="model", fp16_relaxed_precision=1, optimization_type=3)]
    compile_config = CompileConfig(backend="HTP", soc_details="chipset:SM8750;dsp_arch:v79;soc_model:69", graph_custom_configs=graph_config)
    llama_builder.set_compilation_options(compile_config)
    
    # Optional Steps
    # Set the number of splits (user-configurable, can be set based on requirements)
    split_config = SplitModelConfig(split_embedding=False, num_splits=4)
    
    # Set autoregressive tokens: [1] for token generation phase, [64] for prefill phase for better performance
    ar_config = ARn_ContextLengthConfig(auto_regression_number=[1, 64])
    mha_config = MhaConfig()
    transformation_config = ModelTransformerConfig(split_model=split_config, arn_cl_options=ar_config, mha_config=mha_config)
    llama_builder.set_transformation_options(transformation_config)
    Copy to clipboard

## Step 4: Build the GenAIContainer

Once a target is set, you can trigger the build process to build the Gen AI model into an LLM container object.

llama_container: GenAIContainer = llama_builder.build()
    Copy to clipboard

Note

The container contains everything that is needed to execute on the prepared target.  It can be saved to disk and copied to
another location, where it can be loaded to resume operation.  The save/load functionality is demonstrated in Step 7.

## Step 5. Set up an Android device

Connect your Android device via ADB and set the `ANDROID_SERIAL` environment variable.

Obtain the ADB device ID by running:

adb devices
    Copy to clipboard

The output lists connected devices:

List of devices attached
    abcd1234   device
    Copy to clipboard

Set `ANDROID_SERIAL` to the device ID shown:

export ANDROID_SERIAL=abcd1234
    Copy to clipboard

If your device is connected to a remote machine, see the
[remote device troubleshooting](https://docs.qualcomm.com/doc/80-87189-2/topic/genai_builder.html#remote-device-troubleshooting) section in
the [LLM Inference on HTP](https://docs.qualcomm.com/doc/80-87189-2/topic/genai_builder.html#genai-builder) tutorial.

android_serial = os.getenv("ANDROID_SERIAL")
    android_hostname = os.getenv("ANDROID_HOSTNAME")
    
    device_id = f"{android_serial}@{android_hostname}" if android_hostname else android_serial
    android_device = Device(identifier=device_id, type=DevicePlatformType.ANDROID)
    Copy to clipboard

## Step 6: Generate text

The container generated in Step 4 is used to create an executor. The executor is responsible for interfacing with the device
and performing inference. Each executor is customized to a target and specific inference mode, in this example,
the executor is a *text to text executor*.

from qairt.gen_ai_api.executors.gen_ai_executor import GenAIExecutor, GenerationExecutionResult
    
    llm: GenAIExecutor = llama_container.get_executor(android_device, clean_up=False)
    
    # Ensure the appropriate executor is returned
    assert isinstance(llm, T2TExecutor)
    Copy to clipboard

Below, the prompt template and prompt are defined according to the specification for
the Llama 3.2-1b instruct variant and then text is generated using the prompt.

prompt_template = (
        "<|begin_of_text|>"
        "<|start_header_id|>{system}<|end_header_id|>{system_prompt}<|eot_id|>"
        "<|start_header_id|>{user}<|end_header_id|>{user_prompt}<|eot_id|>"
        "<|start_header_id|>{assistant}<|end_header_id|>"
        )
    prompt = prompt_template.format(system="system",
                                    system_prompt="Be helpful but try to limit answers to 40 words.",
                                    user="user", user_prompt="What can I do with a glass jar?",
                                    assistant="assistant",
                                    )
    
    # Generate text
    result: GenerationExecutionResult = llm.generate(prompt)
    print(result.generated_text)
    Copy to clipboard

The command above will generate the following output printed below:

There are many creative and practical uses for a glass jar! Here are some ideas:
    * Store small items like spices, beads, or candies
    * Use as a vase for flowers or branches
    * Create a centerpiece or decorative accent
    * Make homemade candles or potpourri
    * Display decorative objects like seashells, pebbles, or marbles
    * Store food or snacks like nuts, coffee, or tea
    * Use as a planter for small plants or herbs
    * Craft jewelry or decorations
    * Create a self-watering container for seedlings or small plants
    Copy to clipboard

Metrics can be inspected and printed to the console

print(result.metrics)
    Copy to clipboard

Timing (microseconds):
    Init = 1280212 us
    Prompt Processing Time = 831133 us
    Token Generation Time = 4872124 us
    
    Tokens per second (toks/sec):
    
    Prompt Processing Rate = 24.445096969604492 toks/sec
    Token Generation Rate = 24.424795150756836 toks/sec
    Copy to clipboard

The executor pushes artifacts to the device which can be explicitly removed with the following command:

llm.clean_environment()
    Copy to clipboard

## Step 7 (Optional) Save and load the LLM container

You can save the container and its associated artifacts, and subsequently load the container from a different environment.

llama_container.save("./llama3.2_container", exist_ok=True)
    Copy to clipboard

from qairt.gen_ai_api.containers.llm_container import LLMContainer
    
    # Load a container
    container = LLMContainer.load("./llama3.2_container")
    Copy to clipboard

Last Published: May 26, 2026

[Previous Topic
LLM Inference on HTP](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/genai_builder.md) [Next Topic
GGUF Calibration for Activation Encodings](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/gguf_calibration.md)