# LLM Inference on HTP

This tutorial shows how to build and deploy a Large Language Model (LLM) on a Snapdragon device
using the Gen AI Builder API. For an overview of what the builder does and how it works, see
[Gen AI Builder Overview](https://docs.qualcomm.com/doc/80-87189-2/topic/genai_overview.html#genai-overview).

Note

A simplified version of this tutorial is available in the QAIRT SDK at:

> 
> 
> - `examples/QAIRT/python/llm_on_device_inference.py`

## Configurations

- Host OS: Linux (x86\_64) with ADB (Android Debug Bridge) installed.
- Target Devices: Snapdragon Android Device
- Processor: Qualcomm NPU
- Backend: HTP

## Step 1: Setup

We recommend a machine with at least 64 GB of RAM for timely completion of the workflow. If you do not have sufficient RAM, we recommend increasing
your swap memory. This workflow may take at **least 40 minutes** on a machine with RAM &lt; 64 GB.

### Prerequisites

- This tutorial uses the Meta Llama 3-8b-instruct model. You can download the model from Hugging Face: [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
using a valid license.
- The guide assumes you have obtained the quantized ONNX model and associated model artifacts which have been generated using
Step 1 of this notebook: [Llama 3-8b](https://qpm.qualcomm.com/#/main/tools/details/Tutorial_for_Llama3)
- The guide uses a **Snapdragon SD 8 Elite (SM8750) Android device** to demonstrate the workflow.

### Input Directory Structure

The builder expects the following directory layout. This structure is produced by the Step 1 export notebook:

<model_exports>/
        onnx/
            <model>.onnx           # Quantized ONNX model
            <model>.encodings      # Quantization encodings (JSON)
            <model>.data           # External weight data (optional)
        config.json                # Model architecture config (Hugging Face format)
        tokenizer.json             # Tokenizer definition
    Copy to clipboard

Note

You can also obtain [config.json](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/config.json) and
[tokenizer.json](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/tokenizer.json)
from Hugging Face directly.

Note

The `pretrained_model_path` passed to the factory can be the top-level directory or the path to the
`.onnx` file directly. If a directory is provided, the factory locates the ONNX model automatically.
You can also provide explicit `tokenizer_path` and `config_path` arguments if your files are in
non-standard locations.

import os
    
    import qairt
    from qairt import Device, DevicePlatformType
    from qairt.gen_ai_api.builders.gen_ai_builder_htp import GenAIBuilderHTP
    from qairt.gen_ai_api.builders.llama.builder import LlamaBuilderHTP
    from qairt.gen_ai_api.containers.gen_ai_container import GenAIContainer
    from qairt.gen_ai_api.containers.llm_container import LLMContainer
    from qairt.gen_ai_api.executors.t2t_executor import T2TExecutor
    from qairt.gen_ai_api.gen_ai_builder_factory import GenAIBuilderFactory
    
    ############################################################
    # Define the path containing onnx model exports
    llama3_exports = "./llama_3_8b/<your_path>"
    Copy to clipboard

Tip

Set the environment variable **QAIRT\_TMP\_DIR** to define an alternative default temporary directory path.
This is recommended because temporary artifacts are created during build process below which may consume temp memory entirely.

os.environ["QAIRT_TMP_DIR"] = "./llm_scratch/"
    Copy to clipboard

## Step 2: Obtain a GenAIBuilder instance

Create a builder instance, using the optional cache root to store intermediate artifacts.

llama_builder = GenAIBuilderFactory.create(llama3_exports, "HTP", cache_root="./llama3_cache")
    Copy to clipboard

The factory inspects `config.json` for the model and determines which builder subclass is appropriate.
For Llama models, it returns a `LlamaBuilderHTP` instance:

assert isinstance(llama_builder, LlamaBuilderHTP)
    Copy to clipboard

## Step 3: Customize the GenAIBuilder

The builder’s compilation configuration requires customization for the intended device. This is needed
to specialize the Ahead-of-Time (AOT) compilation process for the target device.

### Setting the Target Device

Use `set_targets()` with a chipset specification string:

llama_builder.set_targets(["chipset:SM8750"])
    Copy to clipboard

The chipset string format is `chipset:<CHIPSET_ID>`. Common chipsets:

| Chipset String | Common Name | Platform |
| --- | --- | --- |
| `chipset:SM8750` | Snapdragon 8 Elite | Android |
| `chipset:SM8775` | Snapdragon 8s Elite | Android |
| `chipset:SC8380XP` | Snapdragon X Elite | Windows |
| `chipset:QCS8550` | Qualcomm QCS8550 | Linux IoT |

For a complete list of supported chipset identifiers, see the
[Supported Snapdragon Devices](https://docs.qualcomm.com/doc/80-63442-10/topic/QNN_general_overview.html#supported-snapdragon-devices)
table in the QNN documentation. The chipset ID corresponds to the SoC identifier in that table
(e.g., `SM8750`, `SC8380XP`).

Note

`set_targets()` configures SoC-specific defaults (DSP architecture, optimization level, performance profile).
You can further customize these with `set_compilation_options()`. See [Configuring the Gen AI Builder](https://docs.qualcomm.com/doc/80-87189-2/topic/genai_builder_configuration.html#genai-builder-configuration) for details.

### Optional Configuration

Before building, you can optionally customize the build:

# Enable native KV cache format (recommended for performance)
    llama_builder.native_kv = True
    
    # Enable multiple context lengths
    llama_builder.multi_graph = True
    
    # Override transformation options (e.g., set custom AR numbers and split count)
    llama_builder.set_transformation_options(options={
        "arn": [32, 128],
        "context_length": [2048, 4096],
        "split.num_splits": 4,
    })
    
    # Override compilation options (e.g., HTP performance tuning)
    llama_builder.set_compilation_options(options={
        "graphs.vtcm_size_in_mb": 8,
        "devices.cores.perf_profile": "burst",
    })
    
    # Enable parallel build to fan out the convert, compile, and AR/CL phases
    # across multiple CPU cores (requires cache_root to be set — see Step 2)
    llama_builder.parallel_build = True
    Copy to clipboard

Note

When `parallel_build=True`, worker output is redirected to per-task log files under
`<cache_root>/logs/<phase>/` rather than the console. Resource profiling is automatically
disabled for the duration of the build.

To limit the number of parallel workers, set the `QAIRT_MAX_BUILD_WORKERS` environment
variable, or pass `max_workers=N` when constructing the builder directly instead of using
the factory.

See also

[Configuring the Gen AI Builder](https://docs.qualcomm.com/doc/80-87189-2/topic/genai_builder_configuration.html#genai-builder-configuration) for the full list of configuration options, including transformation
options, compilation options, advanced features like LoRA and speculative decoding, and
[migration guidance](https://docs.qualcomm.com/doc/80-87189-2/topic/genai_migration.html#migration-from-notebooks) from notebook workflows.

The builder also uses a content-addressed cache to skip redundant work when `cache_root` is set
(as shown in Step 2). The cache persists intermediate artifacts across runs so builds can be
stopped and resumed. See [Understanding the Build Cache](https://docs.qualcomm.com/doc/80-87189-2/topic/genai_overview.html#genai-builder-cache) for details on cache layout, invalidation,
and debugging partial builds.

## Step 4: Build the GenAIContainer

Once a target is set, you can trigger the build process to build the Gen AI model into an LLM container object.

llama_container: GenAIContainer = llama_builder.build()
    Copy to clipboard

Tip

To profile memory usage and execution time during the build process, see [Resource Profiler](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-resource-profiler.html#qairt-resource-profiler).

Note

The container contains everything that is needed to execute on the prepared target.  It can be saved to disk and copied to
another location, where it can be loaded to resume operation.  The save/load functionality is demonstrated in Step 7.

## Step 5: Set up an Android device

Connect your Android device via ADB and set the `ANDROID_SERIAL` environment variable.

Obtain the ADB device ID by running:

adb devices
    Copy to clipboard

The output lists connected devices:

List of devices attached
    abcd1234   device
    Copy to clipboard

Set `ANDROID_SERIAL` to the device ID shown:

export ANDROID_SERIAL=abcd1234
    Copy to clipboard

If your device is connected to a remote machine, see the
[remote device troubleshooting](https://docs.qualcomm.com/doc/80-87189-2/topic/genai_builder.html#remote-device-troubleshooting) section in
the [LLM Inference on HTP](https://docs.qualcomm.com/doc/80-87189-2/topic/genai_builder.html#genai-builder) tutorial.

android_serial = os.getenv("ANDROID_SERIAL")
    android_hostname = os.getenv("ANDROID_HOSTNAME")
    
    device_id = f"{android_serial}@{android_hostname}" if android_hostname else android_serial
    android_device = Device(identifier=device_id, type=DevicePlatformType.ANDROID)
    Copy to clipboard

## Step 6: Generate Text with Llama 3

The container generated in step 4 is used to create an executor. The executor is responsible for interfacing with the device
and performing inference. Each executor is customized to a target and specific inference mode, in this example,
the executor is a *text to text executor*.

from qairt.gen_ai_api.executors.gen_ai_executor import GenAIExecutor, GenerationExecutionResult
    
    llm: GenAIExecutor = llama_container.get_executor(android_device, clean_up=False)
    
    # Ensure the appropriate executor is returned
    assert isinstance(llm, T2TExecutor)
    Copy to clipboard

Below we define the prompt template and prompt according to the specification for
the Llama 3-8b chat variant.

prompt_template = (
            "<|begin_of_text|>"
            "<|start_header_id|>{system}<|end_header_id|>{system_prompt}<|eot_id|>"
            "<|start_header_id|>{user}<|end_header_id|>{user_prompt}<|eot_id|>"
            "<|start_header_id|>{assistant}<|end_header_id|>"
    )
    prompt = prompt_template.format(
            system="system",
            system_prompt="Be helpful but try to limit answers to 40 words.",
            user="user",
            user_prompt="What can I do with a glass jar?",
            assistant="assistant",
    )
    
    # Generate text
    result: GenerationExecutionResult = llm.generate(prompt)
    Copy to clipboard

The command above will generate the following output printed below:

print(result.generated_text)
    Copy to clipboard

A glass jar can be used for various purposes:
    * Storage: for dry goods like flour, sugar, or coffee beans.
    * DIY projects: as a vase for flowers, a candle holder, or a planter.
    * Science experiments: as a homemade lava lamp, a density column, or a homemade thermometer.
    These are just a few examples of the many creative and practical uses of a glass jar.
    Copy to clipboard

Metrics can be inspected and printed to the console

print(result.metrics)
    Copy to clipboard

Timing (microseconds):
    
    Init = 1107749 us
    Prompt Processing Time = 6029437 us
    Token Generation Time = 13682340 us
    
    Tokens per second (toks/sec):
    
    Prompt Processing Rate = 177.33641052246094 toks/sec
    Token Generation Rate = 6.0662312507629395 toks/sec
    Copy to clipboard

The executor pushes artifacts to the device which can be explicitly removed with the following command:

llm.clean_environment()
    Copy to clipboard

## Step 7 (Optional): Save and Load the LLM Container

You can save the container and its associated artifacts, and subsequently load the container from a different environment.

llama_container.save("./llama3_container", exist_ok=True)
    Copy to clipboard

from qairt.gen_ai_api.containers.llm_container import LLMContainer
    
    # Load a container
    container = LLMContainer.load("./llama3_container")
    Copy to clipboard

## Troubleshooting

### Build Errors

- **“No space left on device” or build runs out of disk space**
    - Set `QAIRT_TMP_DIR` to a volume with at least 50 GB free space.
Temporary artifacts during the build can consume significant disk space.

export QAIRT_TMP_DIR=/path/to/large/volume/tmp
    Copy to clipboard

- **“Pretrained model path does not exist”**
    - Verify the path and ensure the required artifacts are present:
`config.json`, `tokenizer.json`, and the ONNX model with its `.encodings` file.

- **Build runs out of memory**
    - Models with more than 7 billion parameters typically need 64+ GB RAM.
If insufficient RAM is available, increase swap space. Use `cache_root`
to enable stop/resume if the build is interrupted.

### Configuration Errors

- **Remote device not detected**
    - If your device is connected to a different remote machine, `adb devices` on your local machine
will not list it. Ensure an ADB connection is established on the remote machine first:

# On the remote machine:
    adb -a nodaemon server start
    
    # On your local machine:
    adb -H <remote_machine_hostname> devices
    Copy to clipboard

Once the device is listed, set the hostname environment variable:

export ANDROID_HOSTNAME=<remote_machine_hostname>
    Copy to clipboard

- **“chipset:UNKNOWN” or unknown chipset warning**
    - The chipset string was not recognized. Use one of the supported chipset identifiers
listed in the table above or refer to the full
[Supported Snapdragon Devices](https://docs.qualcomm.com/doc/80-63442-10/topic/QNN_general_overview.html#supported-snapdragon-devices)
list.

- **Options have no effect / “Cannot apply options: no compilation config exists”**
    - `set_compilation_options(options={...})` applies overrides to an existing compilation config.
You must call `set_targets()` first to create the base config, then call
`set_compilation_options()` to override specific fields.

- **“Native KV only supported for AR32, AR64, AR128 and AR256”**
    - When `native_kv=True`, auto-regression numbers must be a subset of `{32, 64, 128, 256}`.
The default AR values with weight sharing enabled are `[1, 128]`, which includes AR=1.
Set explicit AR values:

builder.set_transformation_options(options={"arn": [32, 128]})
    builder.native_kv = True
    Copy to clipboard

## Additional Tutorials

- [GGUF Inference on HTP](https://docs.qualcomm.com/doc/80-87189-2/topic/gguf_builder.html) – Build and deploy pre-quantized GGUF models.
- [Low-Rank Adaptation (LoRA) Tutorial](https://docs.qualcomm.com/doc/80-87189-2/topic/lora_tutorial.html) – Deploy models with LoRA adapters.
- [Speculative Decoding Tutorial](https://docs.qualcomm.com/doc/80-87189-2/topic/speculative_decoding_tutorial.html) – Enable LADE, SSD, or Eaglet speculative decoding.

## Guides

- [Configuring the Gen AI Builder](https://docs.qualcomm.com/doc/80-87189-2/topic/genai_builder_configuration.html#genai-builder-configuration) – Full configuration reference including transformation options,
compilation options, and advanced features.
- [HTP Backend Extensions](https://docs.qualcomm.com/doc/80-87189-2/topic/genai_backend_extensions.html#genai-backend-extensions) – HTP backend extensions JSON structure, loading, and serialization.
- [Migrating from Notebook Workflows](https://docs.qualcomm.com/doc/80-87189-2/topic/genai_migration.html#migration-from-notebooks) – Migration guidance from notebook workflows.
- [GenAIBuilderHTP](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-gen-ai-api-builders-htp.html#htp-builders) – API reference for the `GenAIBuilderHTP` class and model-specific builders.

Last Published: Jul 08, 2026

[Previous Topic
Android devices](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/tutorials.md) [Next Topic
GGUF Inference on HTP](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/gguf_builder.md)