# ONNX Optimizer Overview

The QAIRT ONNX Optimizer is a Python framework for transforming and optimizing
ONNX models before passing them to the QAIRT converter. It accepts both
floating-point and quantized ONNX models as input (including
ONNX+encodings and QDQ ONNX, the latter in beta) and always produces an ONNX
model as output.

For the API reference see [Functions](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-api.html#qairt-optimizer-api) (high-level functions —
recommended starting point) and [Classes & Passes](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt-optimizer-classes) (low-level
[`GraphContext`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.GraphContext) and individual passes — for
custom pipelines). Worked examples live in [ONNX Optimizer Examples](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-examples.html#qairt-optimizer-examples).

Terminology

- AR
    - **Auto-Regressive length** (also called *sequence length*): the number of input
tokens processed in a single forward pass. `AR=1` is the decode phase
(one new token per step); `AR=N` (N &gt; 1) is the prefill phase (processing
the initial prompt of N tokens).

- CL
    - **Context Length**: the maximum total number of tokens that can be stored in the
KV-cache. This determines how long the model’s “memory” is during generation.

- MHA
    - **Multi-Head Attention**: the standard attention block used in
Transformer models, where queries, keys, and values are projected and
reshaped into `H` parallel heads computed as a single batched
operation. Efficient on GPUs but generally slower on Qualcomm NPUs
compared to the equivalent SHA form.

- SHA
    - **Single-Head Attention**: an equivalent rewrite of an [MHA](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-overview.html#term-MHA)
block in which each head is materialised as its own per-head sub-graph
(queries/keys/values sliced along the head dimension). Mathematically
identical to MHA but typically much faster to execute on Qualcomm NPUs.

## When to Use the Optimizer

The optimizer is usually required for LLM/LVM models. It is also recommended to
run on other models when the situations below apply:

Important

MHA→SHA conversion is strongly recommended for any model that contains
attention layers (Transformer-based LLMs, vision transformers, etc.).
Running attention models without this transformation on Qualcomm NPUs will
typically produce significantly lower performance.

- The model contains attention layers (MHA) → apply MHA→SHA conversion via
[`convert_mha_to_sha()`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-api.html#qairt.optimizer.onnx.convert_mha_to_sha).
**Highly recommended for all attention-based models.**
- The model was quantized at a specific sequence/context length and you need to deploy
at different lengths (for example, longer context window) → apply AR/CL rewriting via
[`change_seq_length()`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-api.html#qairt.optimizer.onnx.change_seq_length),
[`change_context_length()`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-api.html#qairt.optimizer.onnx.change_context_length), or
[`change_seq_and_context_length()`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-api.html#qairt.optimizer.onnx.change_seq_and_context_length).
- The model is an MoE (Mixture-of-Experts) architecture → apply MoE adaptation via
[`adapt_moe()`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-api.html#qairt.optimizer.onnx.adapt_moe).
- The LLM produces a compiled binary too large for the device’s execution
partitions → apply model splitting via
[`split_llm()`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-api.html#qairt.optimizer.onnx.split_llm).  Each split is compiled into its
own binary and all splits are executed in pipeline on the same device.

Skipping the optimizer when one of the above applies will typically result in
sub-optimal performance or conversion failures.

Apply the optimizer before [`qairt.convert()`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-core-api.html#qairt.convert).  A minimal end-to-end flow looks like:

# 1. Optimize
    from qairt.optimizer.onnx import GraphContext, convert_mha_to_sha
    
    ctx = GraphContext.from_files("model.onnx", "model.encodings")
    convert_mha_to_sha(ctx)
    exported = ctx.export("./optimized", prefix="model_sha")
    
    # 2. Convert — Python API
    import qairt
    
    model = qairt.convert(
        exported.onnx_path,
        encodings=exported.encodings_path,
    )
    
    # Equivalent CLI:
    # qairt-converter -i ./optimized/model_sha.onnx \
    #                 --quantization_overrides ./optimized/model_sha.encodings
    Copy to clipboard

## Custom Pass Framework

For users who need finer control over the optimization pipeline, want to
author their own custom pass and compose it with (or without) the built-in
passes, or want to understand the optimizer’s internals, the framework is
built around two concepts:

1. [`GraphContext`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.GraphContext) — the central object that
holds the ONNX IR model together with all of its associated metadata
(quantization encodings, LoRA safetensors, updatable tensor names,
naming policy, axis denotations). Every pass receives and modifies a
single `GraphContext`; metadata is never threaded through call
signatures.
2. **Passes** — modular, composable transformation units. Each pass derives
from [`BasePass`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.passes.BasePass) (or
[`BasePredicatePass`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.passes.BasePredicatePass)), implements
`apply(ctx)`, and returns the number of modifications it performed.
Each pass is parameterised by its own typed `Config` dataclass, giving
every transformation a self-contained, discoverable contract.

This shape — small typed passes around a single shared context — is what
makes the framework easy to extend, compose, and verify.

### Extensibility

- **Small authoring contract.** Implement
[`BasePass`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.passes.BasePass) (one `apply(ctx)` method)
for a free-form transformation, or
[`BasePredicatePass`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.passes.BasePredicatePass) (`match(graph,
node)` + `rewrite(graph, node, match_info)`) for pattern-based
rewrites. Iteration, rewrite counting, and pre/post hooks are inherited.
See [Writing and Integrating a Custom Pass](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-examples.html#example-custom-pass) for a complete worked example.
- **Metadata propagation done for you.** Encodings, safetensors, LoRA
tensor names, and axis denotations all live on
[`GraphContext`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.GraphContext). When a pass replaces tensors,
[`mark_value_as_copy()`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.passes.BasePass.mark_value_as_copy) and
[`mark_value_as_slice()`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.passes.BasePass.mark_value_as_slice) carry
the relevant encodings forward — no manual bookkeeping.
- **Operates on standard ONNX IR.** Passes work directly with the ONNX IR
primitives a customer is already familiar with, not on a wrapper or
custom representation.

### Composability

- **Drop in anywhere.** A custom pass can be mixed freely with built-in
passes or invoked between calls to the high-level functions; the only
shared object is [`GraphContext`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.GraphContext).
- **Order-independent.** Each pass reads and writes only the parts of the
graph it needs, so passes can be reordered, repeated, or interleaved
with other custom logic without special setup.
- **Typed configuration.** Each pass owns a dataclass `Config` with
explicit fields, so customer-built pipelines are configured the same way
built-in passes are — no ad-hoc keyword threading.

### Verifiability

- **Rewrite counts as a contract.** Every pass returns the number of
nodes it modified. Zero means the pass did nothing on this model;
`N` is the lever to start with when something downstream changes.
- **Per-pass isolation.** Passes share no global state, so any single
pass can be applied alone and the result exported via
[`export()`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.GraphContext.export) for inspection in
Netron / ORT / Python.
- **Built-in numerical checking.** The high-level functions accept
`validate=True` to compare ORT outputs before and after. The same
checker (`qairt.optimizer.onnx.validation.ort_accuracy_checker`) can
be called between any two passes in a custom pipeline to confirm a
customer-authored transformation preserves model semantics.

See [Classes & Passes](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt-optimizer-classes) for the full reference of
[`GraphContext`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.GraphContext), the built-in passes, and the
base classes you inherit from when authoring custom passes.

Last Published: Jun 19, 2026

[Previous Topic
ONNX Optimizer](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/guides.md) [Next Topic
ONNX Optimizer Examples](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/qairt-optimizer-examples.md)