# Functions

API reference for the high-level `qairt.optimizer.onnx` functions. These
wrap the underlying passes and handle configuration, encoding propagation,
and cleanup automatically.

For background — what the optimizer is, when to apply each function, and an
end-to-end example flow — see [ONNX Optimizer Overview](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-overview.html#qairt-optimizer-overview). For
pass-level control or custom pipelines, see [Classes & Passes](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt-optimizer-classes).

## Sequence and Context Length

- qairt.optimizer.onnx.change\_seq\_length(*ctx: GraphContext*, *new\_seq\_length: int*, *axis\_denotation\_config: Optional[AxisDenotationConfig] = None*) → GraphContext

    - Change the sequence length ([AR](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-overview.html#term-AR)) of an LLM ONNX model.

Rewrites the AR axis throughout the graph — including all input/output
shapes, internal tensor shapes, and shape-dependent constants — so the
model can be deployed at a different sequence length than the one it was
quantized for.  No re-quantization is required.

Common use cases:

- **Decode → prefill rewrite:** convert an `AR=1` (decode) model into an
`AR=N` (prefill) model, or vice versa, without re-running calibration.
- **Variable-length deployment:** produce an `AR=128` model from an
`AR=64` model when the runtime needs a longer prefill window.

The function modifies `ctx` in-place.  When changing both AR and CL,
prefer [`change_seq_and_context_length()`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-api.html#qairt.optimizer.onnx.change_seq_and_context_length) so the rewrite is performed in
a single pass.

The constraint `1 ≤ new_seq_length ≤ context_length - 1` must hold.

- Parameters

    - - **ctx** – Graph context containing the LLM model.
- **new\_seq\_length** – New sequence length (AR) to apply.  Must be at least
`1` and strictly less than the model’s context length.
- **axis\_denotation\_config** – Optional axis-denotation configuration.  Leave
as `None` to use the built-in denotation rules (which cover the
standard HuggingFace LLM input names: `input_ids`,
`attention_mask`, `past_key_*`, `past_value_*`, etc.).
Provide a custom config when your model uses non-standard input
tensor names that the built-in rules cannot identify.  See
[`AxisDenotationConfig`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.AxisDenotationConfig) for details.

- Returns

    - The same [`GraphContext`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.GraphContext), modified in-place.

Example:

from qairt.optimizer.onnx import change_seq_length
    from qairt.optimizer.onnx import GraphContext
    
    ctx = GraphContext.from_files("model.onnx")
    change_seq_length(ctx, 128)
    ctx.export("./output", prefix="model_modified")
    
    # With custom seed rules
    from qairt.optimizer.onnx import (
        change_seq_length,
        AxisDenotationConfig,
        AxisDenotationSeedRule,
        AxisDenotation
    )
    
    axis_denotation_config = AxisDenotationConfig(
        custom_seed_rules=[
            AxisDenotationSeedRule(
                name_pattern=r"my_custom_input",
                denotations=[AxisDenotation.BATCH, AxisDenotation.SEQ_LENGTH]
            )
        ]
    )
    change_seq_length(ctx, 128, axis_denotation_config)
    Copy to clipboard

- qairt.optimizer.onnx.change\_context\_length(*ctx: GraphContext*, *new\_context\_length: int*, *axis\_denotation\_config: Optional[AxisDenotationConfig] = None*) → GraphContext

    - Change the context length ([CL](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-overview.html#term-CL)) of an LLM ONNX model.

Rewrites the CL axis throughout the graph — including KV-cache input/output
shapes, attention-mask shapes, and shape-dependent constants — so the
model can be deployed with a longer (or shorter) KV-cache window than the
one it was quantized for.  No re-quantization is required.

Common use cases:

- **Extending the context window:** deploy a model quantized at
`CL=2048` at `CL=4096` to accept longer prompts.
- **Trimming the context window:** reduce CL on a memory-constrained
device when the application does not need the full window.

The function modifies `ctx` in-place.  When changing both AR and CL,
prefer [`change_seq_and_context_length()`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-api.html#qairt.optimizer.onnx.change_seq_and_context_length) so the rewrite is performed in
a single pass.

The constraint `1 ≤ seq_length ≤ new_context_length - 1` must hold.

- Parameters

    - - **ctx** – Graph context containing the LLM model.
- **new\_context\_length** – New context length (CL) to apply.  Must be strictly
greater than the model’s sequence length (AR).
- **axis\_denotation\_config** – Optional axis-denotation configuration.  Leave
as `None` to use the built-in denotation rules (which cover the
standard HuggingFace LLM KV-cache and attention-mask names).
Provide a custom config when your model uses non-standard tensor
names that the built-in rules cannot identify.  See
[`AxisDenotationConfig`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.AxisDenotationConfig) for details.

- Returns

    - The same [`GraphContext`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.GraphContext), modified in-place.

Example:

from qairt.optimizer.onnx import change_context_length
    from qairt.optimizer.onnx import GraphContext
    
    ctx = GraphContext.from_files("model.onnx")
    change_context_length(ctx, 2048)
    ctx.export("./output", prefix="model_modified")
    Copy to clipboard

- qairt.optimizer.onnx.change\_seq\_and\_context\_length(*ctx: GraphContext*, *new\_seq\_length: int*, *new\_context\_length: int*, *axis\_denotation\_config: Optional[AxisDenotationConfig] = None*) → GraphContext

    - Change both sequence length ([AR](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-overview.html#term-AR)) and context length ([CL](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-overview.html#term-CL)) of
an LLM ONNX model in a single pass.

Rewrites the AR and CL axes throughout the graph — input/output shapes,
KV-cache shapes, attention-mask shapes, and shape-dependent constants — so
the model can be deployed at a different (AR, CL) pair than the one it was
quantized for.  No re-quantization is required.

Prefer this function over chained calls to [`change_seq_length()`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-api.html#qairt.optimizer.onnx.change_seq_length) and
[`change_context_length()`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-api.html#qairt.optimizer.onnx.change_context_length) when both values change: a single pass is
faster and avoids transient invariants (such as `AR > CL`) that the
individual rewrites would briefly violate.

The function modifies `ctx` in-place.

The constraint `1 ≤ new_seq_length ≤ new_context_length - 1` must hold.

- Parameters

    - - **ctx** – Graph context containing the LLM model.
- **new\_seq\_length** – New sequence length (AR) to apply.  Must be at least
`1` and strictly less than `new_context_length`.
- **new\_context\_length** – New context length (CL) to apply.  Must be strictly
greater than `new_seq_length`.
- **axis\_denotation\_config** – Optional axis-denotation configuration.  Leave
as `None` to use the built-in denotation rules (which cover the
standard HuggingFace LLM input and KV-cache names).  Provide a
custom config when your model uses non-standard tensor names that
the built-in rules cannot identify.  See
[`AxisDenotationConfig`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.AxisDenotationConfig) for details.

- Returns

    - The same [`GraphContext`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.GraphContext), modified in-place.

Example:

from qairt.optimizer.onnx import change_seq_and_context_length
    from qairt.optimizer.onnx import GraphContext
    
    ctx = GraphContext.from_files("model.onnx")
    change_seq_and_context_length(ctx, 128, 2048)
    ctx.export("./output", prefix="model_modified")
    Copy to clipboard

## MHA to SHA

- qairt.optimizer.onnx.convert\_mha\_to\_sha(*ctx: GraphContext*, *\**, *m2s\_head\_split\_map: Optional[Dict[int, int]] = None*, *m2s\_additional\_start\_points: Optional[list[qairt.optimizer.onnx.passes.mha2sha.config.M2sStartPoint]] = None*, *m2s\_additional\_end\_points: Optional[list[qairt.optimizer.onnx.passes.mha2sha.config.M2sEndPoint]] = None*, *extract\_lorav2\_alpha: bool = False*, *permute\_kv\_cache\_io: bool = False*, *key\_cache\_name\_pattern: str = 'past\_key\_(\\d)+\_in|past\_key\_(\\d)+\_out'*, *value\_cache\_name\_pattern: str = 'past\_value\_(\\d)+\_in|past\_value\_(\\d)+\_out'*, *enable\_experimental\_layout\_optimization: bool = False*, *validate: bool = False*, *input\_raw\_list\_path: Optional[str] = None*, *input\_raw\_base\_dir: Optional[str] = None*) → GraphContext

    - Convert Multi-Head Attention (MHA) to Single-Head Attention (SHA).

Splits each multi-head attention block in the model into individual per-head
sub-graphs and applies layout optimization (Transpose/Reshape simplification)
to produce a clean, efficient graph suitable for execution on Qualcomm NPUs.
This is the recommended entry point for the MHA→SHA transformation.

The function modifies `ctx` in-place.  Layout optimization always runs,
even on models that contain no MHA pattern.

- Parameters

    - - **ctx** – Graph context containing the ONNX model and any associated metadata
(quantization encodings, LoRA adapters, etc.).
- **m2s\_head\_split\_map** –

    Maps input MHA head count to output SHA head count,
controlling how aggressively each attention layer is split.  Head
counts not present in the map fall back to the wildcard entry
`-1` (default: `-1: 1`, i.e. split to size-1 heads).

    - `None` or `{}` — default behaviour: every multi-head
attention layer is split into size-1 heads.
    - `{-1: 1}` — explicit form of the default.
    - `{32: 8, 16: 4, -1: 1}` — progressive reduction with wildcard:
32 heads → 8, 16 heads → 4, every other count → 1 head.

Note

In a future release, if m2s\_head\_split\_map is not None, head counts that are
not listed in the map and have no `-1` wildcard will be **left unchanged**
rather than split to size-1 heads.  To stay compatible across
versions, set the wildcard explicitly (e.g. `-1: 1`) when
you want unspecified counts to be split.
- **m2s\_additional\_start\_points** – Extra start points for non-standard
attention architectures where the default QKV-MatMul detection does
not find the attention block.  Each entry is an
[`M2sStartPoint`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.M2sStartPoint) describing a tensor
name pattern (regex), the head axis, and an optional per-pattern
split map.  See [`M2sStartPoint`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.M2sStartPoint) for
details.
- **m2s\_additional\_end\_points** – Stopping points for MHA2SHA GroupSlice propagation.
- **extract\_lorav2\_alpha** – Set to `True` **only for LoRA-v2 models** that
store alpha values as graph constants.  When enabled, the alpha
values are extracted into separate metadata rather than embedded
in the SHA sub-graphs.  **Keep this as** `False` for non-LoRA
models and for LoRA-v3 models — enabling it on those will produce
incorrect output.
- **permute\_kv\_cache\_io** – Set to `True` **for any LLM model that uses aKV cache** (essentially all transformer-based LLMs).  This
permutes the KV-cache inputs/outputs from `[batch, head, ...]`
to `[head, batch, ...]` — a more HTP-friendly layout that
gives better on-target performance.  Only leave it at `False`
for non-LLM models or models without a KV cache.
- **key\_cache\_name\_pattern** – Regex pattern matching key-cache tensor names
(inputs and outputs).  The default
`past_key_(\d)+_in|past_key_(\d)+_out` covers the standard
HuggingFace LLM convention.  Override only if your model uses a
different naming scheme.
- **value\_cache\_name\_pattern** – Regex pattern matching value-cache tensor
names.  Same role as `key_cache_name_pattern` for the value side.
- **enable\_experimental\_layout\_optimization** – Use the experimental layout
optimizer instead of the default.  Enable only if the default
layout optimizer causes a performance regression on your specific
model; otherwise leave at `False`.
- **validate** – When `True`, run ONNX Runtime on the model before and after
the transformation and verify numerical equivalence.  Slower, but
recommended the first time you run MHA→SHA on a new model.  By
default, random inputs are used; provide `input_raw_list_path`
(and optionally `input_raw_base_dir`) to use real inputs.
- **input\_raw\_list\_path** – Path to a text file listing raw input tensor files
to use during validation, one entry per line.  Used only when
`validate=True`.  If `None`, validation uses random inputs.
- **input\_raw\_base\_dir** – Base directory for the raw input files referenced
in `input_raw_list_path`.  Used only when `validate=True` and
`input_raw_list_path` contains relative paths.

- Returns

    - The same [`GraphContext`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.GraphContext), modified in-place.

Example:

from qairt.optimizer.onnx import convert_mha_to_sha, GraphContext
    
    ctx = GraphContext.from_files("model.onnx", "model.encodings")
    convert_mha_to_sha(ctx)
    ctx.export("./output", prefix="model_sha")
    Copy to clipboard

## Model Splitting

- qairt.optimizer.onnx.split\_llm(*ctx: GraphContext*, *num\_splits: int*, *\**, *split\_embedding: bool = False*, *split\_lm\_head: bool = False*, *input\_ids\_name: str = 'input\_ids'*, *input\_embeds\_name: str = 'inputs\_embeds'*, *validate: bool = False*) → list[qairt.optimizer.onnx.graph.GraphContext]

    - Split an LLM ONNX model into multiple sequential sub-models.

Cuts the model at residual-add boundaries (the add operations at the end of
each transformer layer) and produces `num_splits` sub-models whose outputs
feed directly into the next sub-model’s inputs.  Use this when the compiled
binary for the full model is too large to fit in a single device-execution
partition: each split is compiled into its own binary and the splits are
executed in pipeline on the same device.

The function modifies `ctx` in-place to produce the split graphs.  Each
returned [`GraphContext`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.GraphContext) carries its own copy of
the relevant encodings, LoRA adapters, and other metadata.

- Parameters

    - - **ctx** – Graph context containing the LLM model to split.
- **num\_splits** – Total number of sequential sub-models to produce, including
any embedding or LM-head splits.  `split_embedding=True` and
`split_lm_head=True` consume one slot each from this count rather
than adding to it (e.g. `num_splits=4` with both flags enabled
yields 1 embedding split + 2 transformer splits + 1 LM-head split).
Must be small enough that the remaining transformer-split budget
does not exceed the model’s layer count — otherwise `ValueError`
is raised.
- **split\_embedding** – When `True`, extract the embedding layer into its
own split as the first output (useful when the embedding table is
large and would dominate a transformer-layer split).  Consumes one
slot from `num_splits`.
- **split\_lm\_head** – When `True`, extract the language-model head into its
own split as the last output.  This can speed up prefill on
target: during prefill only the last token’s hidden state needs
to flow through the LM head to produce the next token, so
isolating the LM head into its own split lets the runtime skip
it for the first `N - 1` tokens of an `AR=N` prefill.
Consumes one slot from `num_splits`.
- **input\_ids\_name** – Name of the integer token-ID graph input. Override when the model
uses a non-standard name (e.g. `"tokens"`). If both `input_ids_name` and
`input_embeds_name` are present in the graph, `input_ids_name` takes priority.
- **input\_embeds\_name** – Name of the float embedding graph input. Use when the embedding
layer has already been split out and the graph no longer contains `input_ids` —
specify the name of the float embedding tensor here instead.
- **validate** – When `True`, run each split on ONNX Runtime and verify that
the chained outputs match the original model’s outputs on random
inputs.  Slower, but recommended the first time you split a new
model.

- Returns

    - A list of [`GraphContext`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-optimizer-passes-classes.html#qairt.optimizer.onnx.GraphContext) objects, one per
split, in execution order.  The list length is exactly `num_splits`.

- Raises

    - - **ValueError** – If the model does not have enough transformer layers for
    the requested split count.
- **EmbeddingNotFoundError** – If `split_embedding=True` but neither `input_ids_name`
    nor `input_embeds_name` is found as a graph input

Example:

from qairt.optimizer.onnx import GraphContext, split_llm
    
    ctx = GraphContext.from_files("path/to/model.onnx")
    splits = split_llm(ctx, num_splits=4, split_embedding=True, split_lm_head=True)
    for i, split in enumerate(splits):
        split.export("./output", prefix=f"split_{i}")
    Copy to clipboard

## MoE Adaptation

- qairt.optimizer.onnx.adapt\_moe(*ctx: GraphContext*, *\**, *overridden\_subselection: Optional[int] = None*, *remove\_op\_predicate: bool = False*, *validate: bool = False*) → GraphContext

    - High-level API for Mixture-of-Experts (MoE) model adaptation.

Adapt a Mixture-of-Experts (MoE) ONNX model, in-place.

Extracts and adapts the AR=N and AR=1 MoE components, inlines internal
functions, removes dead code, and runs shape inference.

When `validate` is True, the adapted model is saved to a temporary
directory and compared against the post-transform model using ONNX Runtime
with random inputs. The temporary directory is cleaned up automatically
after validation.

- Parameters

    - - **ctx** – The model context to adapt.
- **overridden\_subselection** – Override the number of experts selected per
token. If `None`, the value is inferred from the model.
- **remove\_op\_predicate** – Whether to remove the op-predicate `Where` ops
(default `False`).
- **validate** – Whether to verify the transformed model against the
original using ONNX Runtime. Defaults to False.

- Returns

    - The same `GraphContext`, modified in-place.

Usage:

from qairt.optimizer.onnx import adapt_moe
    
    adapt_moe(ctx)
    adapt_moe(ctx, remove_op_predicate=True)
    Copy to clipboard

Last Published: Jun 16, 2026

[Previous Topic
ONNX Optimizer API](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/qairt-optimizer.md) [Next Topic
Classes & Passes](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/qairt-optimizer-passes-classes.md)