# Gen AI Builder Overview

## What is the Gen AI Builder?

The Gen AI Builder is a Python API that automates step 2 of the typical three-step LLM deployment
workflow (quantize → **compile & package** → deploy). It takes a quantized ONNX model (produced
by step 1 of the [QNN model preparation notebooks](https://qpm.qualcomm.com/#/main/tools/details/Tutorial_for_Llama3)) and produces a
`GenAIContainer` ready for on-device inference with a single `build()` call.

NOTEBOOK / CLI PIPELINE (manual)
    ============================================================
    
      AR/CL        Split       MHA2SHA      Convert      Quantize     LoRA       Context
      Convert  --> ONNX    --> Transform --> to DLC   --> DLC      --> Import --> Binary
                                                                                  Gen
    
      7 stages x (AR x CL x splits) = hundreds of CLI invocations

    GEN AI BUILDER API (automated)
    ============================================================
    
      Factory        Configure         builder.build()
      .create()  --> set_targets    --> +---------------------------+
                     native_kv         | All 7 stages automated    |
      Auto-detects   multi_graph       | in a single call          |
      model arch     lora_config       +---------------------------+
                                                 |
                                                 v
                                           LLMContainer
                                       (ready for deployment)
    
      3 API calls replace the entire notebook
    Copy to clipboard

The builder handles:

- **AR/CL conversion** – generates models for each auto-regression and context-length combination
- **ONNX splitting** – partitions the model into device-optimized splits
- **MHA2SHA transformation** – converts multi-head to single-head attention
- **ONNX to DLC conversion** – with quantization encodings applied
- **DLC quantization** – 16-bit activations, 32-bit biases
- **Context binary generation** – with weight sharing across AR/CL variants
- **LoRA import and speculative decoding** – when configured

### Supported Models

The factory auto-detects model architecture from `config.json`. Preconfigured builders exist for all
architectures listed in `SupportedLLMs`, including
Llama, Qwen, Phi, Mistral, Baichuan, and others. Unsupported architectures fall back to a default
`GenAIBuilderHTP` with a warning.

Note

See Appendix for verified model/platform combinations and architecture-specific notes.

#### Understanding the Build Cache

The builder uses a content-addressed cache to avoid redundant work across runs.

## How caching works

Cache keys are SHA-256 hashes of the input configuration and model content. Each hash directory
stores the intermediate artifacts for that exact configuration. On subsequent runs, any stage
whose inputs are unchanged is skipped. Changing any option — targets, split count, AR numbers,
context lengths, etc. — produces a new hash and triggers a full rebuild of affected stages.

## Cache directory layout

Each build stage stores its outputs in a separate hash-keyed subdirectory. The following example
shows the layout for a model compiled into 3 splits with AR 128 and context length 4096:

qwen3_5_cache/
    ├── <ar_hash>/
    │   ├── ar1_cl4096.onnx             # AR/CL converted models (when enabled)
    │   └── ar128_cl4096.onnx
    ├── <onnx_hash>/
    │   ├── ar128_cl4096_1_of_3.onnx   # Per-split ONNX after transformation
    │   ├── ar128_cl4096_2_of_3.onnx
    │   └── ar128_cl4096_3_of_3.onnx
    ├── <dlc_hash>/
    │   ├── ar128_cl4096_1_of_3.dlc    # Converted DLC for this split
    │   ├── ar128_cl4096_2_of_3.dlc
    │   └── ar128_cl4096_3_of_3.dlc
    └── <bin_hash>/
        ├── ar128_cl4096_1_of_3.bin    # Context binary for this split
        ├── ar128_cl4096_2_of_3.bin
        └── ar128_cl4096_3_of_3.bin
    Copy to clipboard

## Invalidating the cache

To force a full rebuild (for example, after updating the SDK):

- Delete the hash directory for the affected configuration, or
- Point `cache_root` to a new location, or
- Modify any input that changes the content hash (model file, config options, etc.)

## Inspecting a partial build

When a build fails mid-way, the cache directory will contain whatever artifacts were completed
before the failure. To see which splits compiled successfully:

ls cache_dir/*/*.bin
    Copy to clipboard

Last Published: May 08, 2026

Previous Topic
 
Gen AI Builder Next Topic

Configuring the Gen AI Builder