# Gen AI Builder Overview

## What is the Gen AI Builder?

The Gen AI Builder is a Python API that automates step 2 of the typical three-step LLM deployment
workflow (quantize → **compile & package** → deploy). It takes a quantized ONNX model (produced
by step 1 of the [QNN model preparation notebooks](https://qpm.qualcomm.com/#/main/tools/details/Tutorial_for_Llama3)) and produces a
`GenAIContainer` ready for on-device inference with a single `build()` call.

NOTEBOOK / CLI PIPELINE (manual)
    ============================================================
    
      AR/CL        Split       MHA2SHA      Convert      Quantize     LoRA       Context
      Convert  --> ONNX    --> Transform --> to DLC   --> DLC      --> Import --> Binary
                                                                                  Gen
    
      7 stages x (AR x CL x splits) = hundreds of CLI invocations

    GEN AI BUILDER API (automated)
    ============================================================
    
      Factory        Configure         builder.build()
      .create()  --> set_targets    --> +---------------------------+
                     native_kv         | All 7 stages automated    |
      Auto-detects   multi_graph       | in a single call          |
      model arch     lora_config       +---------------------------+
                                                 |
                                                 v
                                           GenAIContainer
                                       (ready for deployment)
    
      3 API calls replace the entire notebook
    Copy to clipboard

### Supported Models

The factory auto-detects model architecture from `config.json`. Preconfigured builders exist for all
architectures listed in `SupportedLLMs`, including
Llama, Qwen, Phi, Mistral, Baichuan, and others. Unsupported architectures fall back to a default
`GenAIBuilderHTP` with a warning.

Note

See Appendix for verified model/platform combinations and architecture-specific notes.

#### What the Builder Does Automatically

Note

The builder expects a quantized ONNX model and its associated encodings file as input.
These are produced by step 1 of the QNN model preparation notebooks (the AIMET quantization
step). See the [notebook tutorials](https://qpm.qualcomm.com/#/main/tools/details/Tutorial_for_Llama3)
if you have not yet quantized your model.

When you call `build()`, the builder automates the following stages that are manual in notebooks:

1. **AR/CL conversion** – generates ONNX models for each AR x CL combination
2. **ONNX splitting** – partitions the model into N splits
3. **MHA2SHA transformation** – converts multi-head to single-head attention per split
4. **ONNX to DLC conversion** – with quantization overrides from encodings
5. **DLC quantization** – `act_bitwidth=16`, `bias_bitwidth=32`
6. **LoRA graph building and import** – when `lora_config` is set
7. **Context binary generation** – with weight sharing and native KV format config

#### Understanding the Build Cache

The builder uses a content-addressed cache to avoid redundant work across runs.

## How caching works

Cache keys are SHA-256 hashes of the input configuration and model content. Each hash directory
stores the intermediate artifacts for that exact configuration. On subsequent runs, any stage
whose inputs are unchanged is skipped. Changing any option — targets, split count, AR numbers,
context lengths, etc. — produces a new hash and triggers a full rebuild of affected stages.

## Cache directory layout

Each build stage stores its outputs in a separate hash-keyed subdirectory. The following example
shows the layout for a model compiled into 3 splits with AR 128 and context length 4096:

qwen3_5_cache/
    ├── <ar_hash>/
    │   ├── ar1_cl4096.onnx             # AR/CL converted models (when enabled)
    │   └── ar128_cl4096.onnx
    ├── <onnx_hash>/
    │   ├── ar128_cl4096_1_of_3.onnx   # Per-split ONNX after transformation
    │   ├── ar128_cl4096_2_of_3.onnx
    │   └── ar128_cl4096_3_of_3.onnx
    ├── <dlc_hash>/
    │   ├── ar128_cl4096_1_of_3.dlc    # Converted DLC for this split
    │   ├── ar128_cl4096_2_of_3.dlc
    │   └── ar128_cl4096_3_of_3.dlc
    └── <bin_hash>/
        ├── ar128_cl4096_1_of_3.bin    # Context binary for this split
        ├── ar128_cl4096_2_of_3.bin
        └── ar128_cl4096_3_of_3.bin
    Copy to clipboard

## Invalidating the cache

To force a full rebuild (for example, after updating the SDK):

- Delete the hash directory for the affected configuration, or
- Point `cache_root` to a new location, or
- Modify any input that changes the content hash (model file, config options, etc.)

## Inspecting a partial build

When a build fails mid-way, the cache directory will contain whatever artifacts were completed
before the failure. To see which splits compiled successfully:

ls cache_dir/*/*.bin
    Copy to clipboard

Last Published: May 26, 2026

Previous Topic
 
Gen AI Builder Next Topic

Configuring the Gen AI Builder