# Gen AI Builder Overview

## What is the Gen AI Builder?

The Gen AI Builder is a Python API that automates step 2 of the typical three-step LLM deployment
workflow (quantize → **compile & package** → deploy). It takes a quantized ONNX model (produced
by step 1 of the [QNN model preparation notebooks](https://qpm.qualcomm.com/#/main/tools/details/Tutorial_for_Llama3)) and produces a
`GenAIContainer` ready for on-device inference with a single `build()` call.

NOTEBOOK / CLI PIPELINE (manual)
    ============================================================
    
      AR/CL        Split       MHA2SHA      Convert      Quantize     LoRA       Context
      Convert  --> ONNX    --> Transform --> to DLC   --> DLC      --> Import --> Binary
                                                                                  Gen
    
      7 stages x (AR x CL x splits) = hundreds of CLI invocations

    GEN AI BUILDER API (automated)
    ============================================================
    
      Factory        Configure         builder.build()
      .create()  --> set_targets    --> +---------------------------+
                     native_kv         | All 7 stages automated    |
      Auto-detects   multi_graph       | in a single call          |
      model arch     lora_config       +---------------------------+
                                                 |
                                                 v
                                           GenAIContainer
                                       (ready for deployment)
    
      3 API calls replace the entire notebook
    Copy to clipboard

### Supported Models

The factory auto-detects model architecture from `config.json`. Preconfigured builders exist for all
architectures listed in [`SupportedLLMs`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-gen-ai-api-gen-ai-builder-factory.html#qairt.gen_ai_api.gen_ai_builder_factory.SupportedLLMs), including
Llama, Qwen, Phi, Mistral, Baichuan, and others. Unsupported architectures fall back to a default
`GenAIBuilderHTP` with a warning.

Note

See [Appendix](https://docs.qualcomm.com/doc/80-87189-2/topic/appendix.html#appendix) for verified model/platform combinations and architecture-specific notes.

#### What the Builder Does Automatically

Note

The builder expects a quantized ONNX model and its associated encodings file as input.
These are produced by step 1 of the QNN model preparation notebooks (the AIMET quantization
step). See the [notebook tutorials](https://qpm.qualcomm.com/#/main/tools/details/Tutorial_for_Llama3)
if you have not yet quantized your model.

When you call `build()`, the builder automates the following stages that are manual in notebooks:

1. **AR/CL conversion** – generates ONNX models for each AR x CL combination
2. **ONNX splitting** – partitions the model into N splits
3. **MHA2SHA transformation** – converts multi-head to single-head attention per split
4. **ONNX to DLC conversion** – with quantization overrides from encodings
5. **DLC quantization** – `act_bitwidth=16`, `bias_bitwidth=32`
6. **LoRA graph building and import** – when `lora_config` is set
7. **Context binary generation** – with weight sharing and native KV format config

#### Understanding the Build Cache

The builder uses a content-addressed cache to avoid redundant work across runs.

## How caching works

Cache keys are SHA-256 hashes of the input configuration and model content. Each stage stores the
intermediate artifacts for that exact configuration in its own subdirectory. On subsequent runs, any
stage whose inputs are unchanged is skipped. Changing any option — targets, split count, AR numbers,
context lengths, etc. — produces a new hash and triggers a full rebuild of affected stages.

## Cache directory layout

Artifacts are organized two levels deep: a human-readable top-level directory per model (slugified from
the model file name), then one subdirectory per build stage named `<operation>_<hash>`, where
`<operation>` is one of `arcl`, `transform`, `convert`, or `compile` and `<hash>` is the
first 32 hex characters of the stage’s SHA-256 content hash. The following example shows the layout for a
`llama32` model compiled into 3 splits with AR 128 and context length 4096:

<cache_root>/
    └── llama32/                                  # human-readable model group (slugified model stem)
        ├── arcl_14a468361bda46c6ab0338f1b6.../    # AR/CL converted ONNX (when enabled)
        │   ├── ar1_cl4096.onnx
        │   ├── ar128_cl4096.onnx
        │   └── builder_cache_info.json
        ├── transform_19d471ebbd72475187007c.../   # per-split transformed ONNX (one subdir per split)
        │   ├── ar128_cl4096_3_of_3/
        │   │   ├── ar128_cl4096_3_of_3.onnx
        │   │   ├── ar128_cl4096_3_of_3.data
        │   │   └── ar128_cl4096_3_of_3.encodings
        │   ├── ...
        │   └── builder_cache_info.json
        ├── convert_cc7d4fcc8533e37b374e1ae3.../   # converted DLC for a split
        │   ├── ar128_cl4096_3_of_3.dlc
        │   └── builder_cache_info.json
        └── compile_1ce356f11209360ab98a88c1.../   # context binary for a split
            ├── ar1_cl4096_3_of_3.bin
            ├── ar1_cl4096_3_of_3_cache_info.json  # QNN context binary metadata
            └── builder_cache_info.json
    Copy to clipboard

Each stage directory contains a `builder_cache_info.json` sidecar for offline inspection of what
produced that cache entry. It has the following fields:

| Field | Description |
| --- | --- |
| `operation` | The build stage: `arcl`, `transform`, `convert`, or `compile`. |
| `source` | Path to the input artifact that was hashed for this stage. |
| `config` | The configuration options that were hashed, as a dictionary of stringified values (for inspection only — see the note below). |
| `hash` | The full 64-character SHA-256 hash (the directory name uses the first 32 characters). |
| `created_at` | ISO-8601 UTC timestamp of when the entry was written. |

For example, a `convert` stage sidecar records the converter options used to produce the DLC:

{
      "operation": "convert",
      "source": ".../transform_19d471.../ar128_cl4096_3_of_3/ar128_cl4096_3_of_3.onnx",
      "config": {
        "export_format": "DLC_DEFAULT",
        "act_precision": "16",
        "weights_precision": "8",
        "per_channel_quantization": "True"
      },
      "hash": "cc7d4fcc8533e37b374e1ae38423b42e741168d613bb3602137466ae4f63ed2f",
      "created_at": "2026-06-10T04:47:10.673060+00:00"
    }
    Copy to clipboard

Note

The `config` values are stringified for human inspection; they are **not** a serialized object you
can load directly back into a config class. To reuse settings programmatically, read the values you
need and reconstruct the corresponding config object (see [Replaying a stage from the cache](https://docs.qualcomm.com/doc/80-87189-2/topic/genai_overview.html#genai-builder-cache-recompile)).

## Invalidating the cache

To force a full rebuild (for example, after updating the SDK):

- Delete the stage directory (or the whole model group) for the affected configuration, or
- Point `cache_root` to a new location, or
- Modify any input that changes the content hash (model file, config options, etc.)

## Inspecting a partial build

When a build fails mid-way, the cache directory will contain whatever artifacts were completed
before the failure. To see which splits compiled successfully:

find <cache_root> -name '*.bin'
    Copy to clipboard

## Replaying a stage from the cache

Each stage’s `builder_cache_info.json` records the full configuration used to produce its output,
so any stage can be replayed independently without rerunning the whole builder. The example below
shows re-running the compile stage for a single converted split using the recorded settings. The
same approach applies to convert, transform, and arcl stages.

import json
    import qairt
    from qairt.api.compiler.config import CompileConfig
    
    cache_root = "<cache_root>/llama32"
    
    # 1. Inspect the compile settings recorded for a previous build.
    with open(f"{cache_root}/compile_<hash>/builder_cache_info.json") as f:
        compile_info = json.load(f)
    print(compile_info["config"]["backend"])      # e.g. "BackendType.HTP"
    print(compile_info["config"]["soc_details"])  # e.g. "chipset:SM8850"
    
    # 2. Load a converted split (the .dlc produced by a convert stage).
    split = qairt.load(f"{cache_root}/convert_<hash>/my_split.dlc")
    
    # 3. Reconstruct a CompileConfig from the inspected settings and re-compile.
    config = CompileConfig(backend="HTP", soc_details="chipset:SM8850")
    compiled = qairt.compile(split, config=config)
    compiled.save("my_split_recompiled.bin")
    Copy to clipboard

The `config` dictionary in the sidecar lists every option that was applied. To reload
the full configuration as a pydantic object rather than constructing it by hand, pass
the dict directly to the config class. The sidecar may contain extra keys; set
`extra="allow"` on the model config before calling `model_validate`:

CompileConfig.model_config["extra"] = "allow"
    config = CompileConfig.model_validate(compile_info["config"])
    Copy to clipboard

Last Published: Jun 19, 2026

[Previous Topic
Gen AI Builder](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/guides.md)