# Configuring the Gen AI Builder

This page documents all configuration options for the Gen AI Builder API. For a step-by-step
tutorial, see LLM Inference on HTP.

## Configuration Overview

The builder follows a layered configuration model. Only `create()` and `set_targets()` are
required – everything else has sensible defaults.

## Quick-Reference Table

| Method / Property | Default | Purpose |
| --- | --- | --- |
| `set_targets(["chipset:..."])` | *(required)* | Set target SoC for AOT compilation |
| `weight_sharing` | `True` | Share weights across AR/CL variants per split |
| `multi_graph` | `False` | Enable multiple context lengths [512, 1024, 2048, 3072, 4096] |
| `native_kv` | `False` | Enable native KV cache format (requires AR in {32, 64, 128, 256}) |
| `skip_ar_conversion` | `False` | Skip AR conversion; only vary context length |
| `encodings_path` | Auto-discovered | Override path to quantization encodings |
| `set_transformation_options()` | Auto-configured | Override model transformation settings |
| `set_compilation_options()` | From `set_targets()` | Override HTP compilation settings |
| `set_conversion_options()` | Auto-configured | Override ONNX-to-DLC conversion settings |
| `lora_config` | `None` | Enable LoRA adapter support |
| `speculative_config` | `None` | Enable speculative decoding (LADE, SSD, or Eaglet) |
| `attach_model_for_arn()` | Not set | Pin a specific ONNX model to an AR value |

## Convenience Properties

These properties control the most common configuration choices.

### weight\_sharing

When `True` (default), the builder generates multiple AR variants (default `[1, 128]`)
and shares weights across them in each compiled split. This reduces binary size at the cost
of slightly more complex compilation.

builder.weight_sharing = True   # Default: ARN = [1, 128]
    Copy to clipboard

To disable weight sharing, first set AR to a single value (weight sharing requires multiple AR variants to share weights across):

builder.set_transformation_options(options={"arn": [128]})
    builder.weight_sharing = False
    Copy to clipboard

### multi\_graph

When `True`, the builder generates context binaries for multiple context lengths:
`[512, 1024, 2048, 3072, 4096]`. When `False` (default), only `[4096]` is used.

builder.multi_graph = True  # Context lengths: [512, 1024, 2048, 3072, 4096]
    Copy to clipboard

Note

For finer control over context lengths, use `set_transformation_options(options={"context_length": [...]})`
instead.

### native\_kv

When `True`, enables native KV cache format optimization. This also automatically sets
`permute_kv_cache_io=True` in the MHA2SHA transformation.

builder.set_transformation_options(options={"arn": [32, 128]})
    builder.native_kv = True
    Copy to clipboard

## Transformation Options

Transformation options control model structure (splitting, AR/CL variants, attention conversion).

### Usage

# Option A: Individual overrides via options dict (recommended)
    builder.set_transformation_options(options={
        "arn": [32, 128],
        "context_length": [2048, 4096, 6144, 8192],
        "split.num_splits": 4,
        "split.split_embedding": True,
    })
    
    # Option B: Full config object (advanced)
    from qairt.api.transforms.model_transformer_config import (
        SplitModelConfig,
        ModelTransformerConfig,
    )
    split_config = SplitModelConfig(split_embedding=True, num_splits=4)
    config = ModelTransformerConfig(split_model=split_config)
    builder.set_transformation_options(config=config)
    Copy to clipboard

### Transformation Keys and Defaults

The table below lists all supported `options` keys with their defaults. For full type
information and descriptions of each key, see
`set_transformation_options()`.

| Key | Default |
| --- | --- |
| `arn` | `[1, 128]` (with weight\_sharing) |
| `context_length` | `[4096]` (or multi\_graph defaults) |
| `split.num_splits` | Auto-calculated |
| `split.split_embedding` | `True` |
| `split.split_lm_head` | Varies by builder |
| `mha2sha.permute_kv_cache_io` | `False` (auto when native\_kv) |
| `mha2sha.m2s_additional_start_points` | `[]` |
| `amoe.overridden_subselection` | Not set (MoE models only) |
| `amoe.remove_op_predicate` | Not set (MoE models only) |

#### How Split Count is Determined

When `split.num_splits` is not set explicitly, the builder auto-calculates it from the total
model parameter count. Because `split_embedding` and `split_lm_head` both default to `True`,
the embedding layer and LM head are each placed in their own split. The remaining decoder layers
are then divided into splits of approximately **2 GB** each:

num_splits = 3 + model_params // 2 GB
    Copy to clipboard

The base count of **3** accounts for:

1. **Embedding split** – the token-embedding layer
2. **LM-head split** – the language-model head layer
3. **First decoder split** – at least one split for decoder layers

Each additional 2 GB of parameters adds one more decoder split. For example, a 7B-parameter
model (roughly 7 GB at FP32) yields `3 + 7 // 2 = 6` splits.

#### Skipping Transformations

By default, the builder applies all relevant transformations (AR/CL conversion, model splitting,
MHA2SHA, and MoE adaptation for expert models). You can selectively disable individual
transformations for pre-transformed models, debugging, or custom workflows.

**Skip AR/CL Conversion**

Use the `skip_ar_conversion` property to keep the model at its original sequence length
without generating AR variants:

builder.skip_ar_conversion = True
    Copy to clipboard

This inserts a sentinel value `0` into the AR list, which is resolved at build time from the
model’s sequence length. Equivalently:

builder.set_transformation_options(options={"arn": [0]})
    Copy to clipboard

**Skip Model Splitting**

To produce a single unsplit model, set `num_splits=1` and disable embedding/LM-head extraction:

builder.set_transformation_options(options={
        "split.num_splits": 1,
        "split.split_embedding": False,
        "split.split_lm_head": False,
    })
    Copy to clipboard

**Skip MHA2SHA**

To disable the Multi-Head Attention to Single-Head Attention conversion, pass a full config with
`mha_config=None`. This requires the `config=` parameter since the options dict does not
support disabling an entire transformation:

from qairt.api.transforms.model_transformer_config import (
        ModelTransformerConfig,
        ARn_ContextLengthConfig,
        SplitModelConfig,
        MhaConfig,
    )
    
    builder.set_transformation_options(config=ModelTransformerConfig(
        arn_cl_options=ARn_ContextLengthConfig(),
        split_model=SplitModelConfig(num_splits=4, split_embedding=True, split_lm_head=True),
        mha_config=None,
    ))
    Copy to clipboard

**Skip MoE Adaptation**

MoE adaptation is only auto-enabled for models with an expert configuration (detected from
`config.json`). To disable it on an MoE model, pass a full config with `adapt_moe=None`:

builder.set_transformation_options(config=ModelTransformerConfig(
        arn_cl_options=ARn_ContextLengthConfig(),
        split_model=SplitModelConfig(num_splits=4, split_embedding=True, split_lm_head=True),
        mha_config=MhaConfig(),
        adapt_moe=None,
    ))
    Copy to clipboard

## Compilation Options

Compilation options control the HTP backend settings used during Ahead-of-Time compilation.

The builder exposes **two paths** for setting compilation options:

### Path A: Convenience Dict (Common Settings)

For the most commonly adjusted fields, use the `options` dict. This applies overrides
on top of the config created by `set_targets()`.

builder.set_compilation_options(options={
        "graphs.vtcm_size_in_mb": 8,
        "graphs.hvx_threads": 4,
        "graphs.optimization_type": 3,
        "devices.cores.perf_profile": "burst",
        "context.extended_udma": True,
    })
    Copy to clipboard

Important

`set_targets()` must be called **before** `set_compilation_options(options={...})`.
The options dict modifies the config that `set_targets()` creates.

The table below lists all supported convenience keys with their defaults. For full type
information and descriptions, see
`set_compilation_options()`.

| Key | Default |
| --- | --- |
| `graphs.vtcm_size_in_mb` | `0` (device max) |
| `graphs.vtcm_size` | `0` (device max, in bytes) |
| `graphs.hvx_threads` | `0` (backend default) |
| `graphs.optimization_type` | `3` (from set\_targets) |
| `devices.cores.perf_profile` | `"burst"` (from set\_targets) |
| `context.extended_udma` | `False` |

### Path B: Full CompileConfig or Backend Extensions JSON

For settings **not covered** by the convenience dict (such as `fp16_relaxed_precision`,
`rpc_control_latency`, `pd_session`, `mem_type`, or `share_resources`), you need a full
`CompileConfig` object. There are two ways to obtain one:

1. **Load an existing backend extensions JSON file** using
`CompileConfig.from_backend_extensions()` or `populate_from_backend_extensions()`.
2. **Build one from Python** using the HTP config classes directly.

See HTP Backend Extensions for the JSON structure, Python construction examples,
the full list of HTP configuration classes, and round-trip serialization.

## Conversion Options

Conversion options control the ONNX-to-DLC conversion step. The builder auto-configures sensible
defaults (`act_precision=16`, `bias_precision=32`), so most builds do not need to call
`set_conversion_options()` at all.

See also

Advanced Features – LoRA adapters, speculative decoding, and attaching custom
models for specific AR values.

Last Published: May 08, 2026

Previous Topic
 
Inspecting a partial build Next Topic

Advanced Features