# Generation and Evaluation

Classes for text generation and model evaluation within the pipeline.

## LLMGenerator

LLM Generator class to restore HF API on models with static shape constraints

- *class* qairt.experimental.pipeline.torch.llm.generation.generator.HybridLLMGenerator(*model*, *tokenizer: PreTrainedTokenizer*, *sequence\_length: int*, *context\_length: int*, *config: Optional[PretrainedConfig] = None*, *attention\_mask\_min: int = -100*, *bypass\_adapted\_forward: bool = False*, *\*\*kwargs*)

    - Bases: [`LLMGenerator`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-pipeline-generation.html#qairt.experimental.pipeline.torch.llm.generation.generator.LLMGenerator)

Generator for hybrid architecture models (e.g., Qwen3.5 with GatedDeltaNet).
It extends LLMGenerator to manage heterogeneous past\_key\_values, consisting of
standard Key/Value caches for full\_attention layers, and conv\_states/recurrent\_states
for linear\_attention layers.

- *classmethod* create\_position\_embeddings(*model*, *config*, *position\_ids*, *dtype=torch.float32*)

    - Create position embeddings for the model.

- Parameters

    - - **model** – The language model containing a `RotaryEmbedding` submodule.
- **config** (*PretrainedConfig*) – Model configuration used to determine the
head dimension (`config.head_dim` or
`config.hidden_size // config.num_attention_heads`).
- **position\_ids** (*torch.Tensor*) – Position indices of shape
`(batch, seq_len)`.
- **dtype** (*torch.dtype*) – Data type for the output embeddings. Defaults
to `torch.float32`.

- Returns

    - A `(cos, sin)` pair, each of
shape `(batch, 1, seq_len, head_dim // 2)`.

- Return type

    - Tuple[torch.Tensor, torch.Tensor]

- *classmethod* prepare\_inputs(*model*, *input\_ids: Optional[Tensor]*, *attention\_mask: Optional[Tensor]*, *past\_key\_values: List[Tuple[Tensor, Tensor]]*, *sequence\_length: int*, *context\_length: int*, *attention\_mask\_min: int = -100*, *inputs\_embeds: Optional[Tensor] = None*, *position\_ids: Optional[Tensor] = None*, *\**, *cache\_index: Optional[Tensor] = None*, *pad\_token: Optional[int] = None*, *config: Optional[PretrainedConfig] = None*, *dtype: Optional[dtype] = None*, *\*\*kwargs*) → Dict[str, Union[Tensor, Tuple[Tensor, Tensor], List[Tuple[Tensor, Tensor]], Tuple[Tuple[Tensor, Tensor], ...]]]

    - Prepare all inputs for a model forward pass under static graph constraints.

- Parameters

    - - **model** – The language model. Used to access `model.config`,
`model.dtype`, `model.device`, and the RoPE layer.
- **input\_ids** (*torch.Tensor*  *|* *None*) – Token IDs of shape
`(batch, input_length)`. Mutually exclusive with
`inputs_embeds`.
- **attention\_mask** (*torch.Tensor*  *|* *None*) – Attention mask of shape
`(batch, input_length)`. If `None`, a mask of ones is
created.
- **past\_key\_values** (*List* *[* *Tuple* *[* *torch.Tensor* *,* *torch.Tensor* *]* *]*) – Cached
key/value pairs from previous steps. Pass an empty list for the
first step.
- **sequence\_length** (*int*) – Static sequence length (ARN) the model
expects per forward pass.
- **context\_length** (*int*) – Total context window size (KV cache capacity +
sequence length).
- **attention\_mask\_min** (*int*) – Minimum value used to clamp the causal
attention mask (large negative number to mask out positions).
Defaults to `-100`.
- **inputs\_embeds** (*torch.Tensor*  *|* *None*) – Pre-computed embeddings of
shape `(batch, input_length, hidden_dim)`. Mutually exclusive
with `input_ids`.
- **position\_ids** (*torch.Tensor*  *|* *None*) – Explicit position IDs of shape
`(batch, input_length)`. If `None`, they are derived from
the cumulative sum of the attention mask.
- **cache\_index** (*torch.Tensor*  *|* *None*) – KV cache write position
for the current step. If `None`, it is inferred from the
length of `past_key_values`.
- **pad\_token** (*int*) – Token ID used to pad `input_ids` to
`sequence_length`. Defaults to `0`.
- **config** (*PretrainedConfig*  *|* *None*) – Model configuration override. If
`None`, `model.config` is used. Defaults to `None`.
- **dtype** (*torch.dtype*  *|* *None*) – Data type for KV cache and attention mask
tensors. If `None`, `model.dtype` is used. Defaults to `None`.
- **\*\*kwargs** – Additional keyword arguments forwarded to
`_prepare_attention_mask()`.

- Returns

    - A dictionary with the following keys:

- `"input_ids"` or `"inputs_embeds"` (torch.Tensor): Input
tokens or embeddings padded to `sequence_length`.
- `"attention_mask"` (torch.Tensor): Causal attention mask of
shape `(batch, 1, sequence_length, context_length)`, clamped
to `attention_mask_min`.
- `"position_ids"` (Tuple[torch.Tensor, torch.Tensor]): RoPE
`(cos, sin)` embeddings derived from position IDs.
- `"past_key_values"` (List[Tuple[torch.Tensor, torch.Tensor]]):
KV cache padded to `context_length`.
- `"cache_index"` (torch.Tensor): Scalar tensor indicating the
current write position in the KV cache.

- Return type

    - Dict[str, Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor], List[Tuple[torch.Tensor, torch.Tensor]], Tuple[Tuple[torch.Tensor, torch.Tensor], …]]]

- Raises

    - **ValueError** – If both `input_ids` and `inputs_embeds` are
    provided, or if neither is provided.

- *class* qairt.experimental.pipeline.torch.llm.generation.generator.LLMGenerationMixin

    - Bases: `GenerationMixin`

Helper class to restore HuggingFace LLM API to Torch and ONNX models with static shape requirements, including
the <cite>forward</cite> and <cite>generate</cite> APIs

- *classmethod* create\_position\_embeddings(*model*, *config*, *position\_ids*, *dtype=torch.float32*)

    - Create position embeddings for the model.

- Parameters

    - - **model** – The language model containing a `RotaryEmbedding` submodule.
- **config** (*PretrainedConfig*) – Model configuration used to determine the
head dimension (`config.head_dim` or
`config.hidden_size // config.num_attention_heads`).
- **position\_ids** (*torch.Tensor*) – Position indices of shape
`(batch, seq_len)`.
- **dtype** (*torch.dtype*) – Data type for the output embeddings. Defaults
to `torch.float32`.

- Returns

    - A `(cos, sin)` pair, each of
shape `(batch, 1, seq_len, head_dim // 2)`.

- Return type

    - Tuple[torch.Tensor, torch.Tensor]

- *classmethod* get\_input\_names(*num\_layers: int*, *\**, *io\_type: Optional[IOType] = None*) → Tuple[str, ...]

    - Return the ordered input tensor names for a model with `num_layers` transformer layers.

Uses `cls.io_config` when it is set and its `io_type` matches the
requested *io\_type* (or when *io\_type* is `None`).  A new
`LlmIOConfig` is created when `cls.io_config` is `None` or
when *io\_type* is provided and differs from `cls.io_config.io_type`.

- Parameters

    - - **num\_layers** (*int*) – Number of transformer layers (i.e., the number of
key/value cache pairs).
- **io\_type** (*IOType*  *|* *None*) – Naming convention for input tensors.
Defaults to `IOType.GENIE` when no `io_config` is set.

- Returns

    - Ordered tuple of input tensor names.

- Return type

    - Tuple[str, …]

- *classmethod* get\_output\_names(*num\_layers: int*, *\**, *io\_type: Optional[IOType] = None*) → Tuple[str, ...]

    - Return the ordered output tensor names for a model with `num_layers` transformer layers.

Uses `cls.io_config` when it is set and its `io_type` matches the
requested *io\_type* (or when *io\_type* is `None`).  A new
`LlmIOConfig` is created when `cls.io_config` is `None` or
when *io\_type* is provided and differs from `cls.io_config.io_type`.

- Parameters

    - - **num\_layers** (*int*) – Number of transformer layers (i.e., the number of
key/value cache pairs).
- **io\_type** (*IOType*  *|* *None*) – Naming convention for output tensors.
Defaults to `IOType.GENIE` when no `io_config` is set.

- Returns

    - Ordered tuple of output tensor names.

- Return type

    - Tuple[str, …]

- io\_config*: Optional[LlmIOConfig]*  *= None*

    - 

- *classmethod* prepare\_inputs(*model*, *input\_ids: Optional[Tensor]*, *attention\_mask: Optional[Tensor]*, *past\_key\_values: List[Tuple[Tensor, Tensor]]*, *sequence\_length: int*, *context\_length: int*, *attention\_mask\_min: int = -100*, *inputs\_embeds: Optional[Tensor] = None*, *position\_ids: Optional[Tensor] = None*, *\**, *cache\_index: Optional[Tensor] = None*, *pad\_token: Optional[int] = None*, *config: Optional[PretrainedConfig] = None*, *dtype: Optional[dtype] = None*, *\*\*kwargs*) → Dict[str, Union[Tensor, Tuple[Tensor, Tensor], List[Tuple[Tensor, Tensor]], Tuple[Tuple[Tensor, Tensor], ...]]]

    - Prepare all inputs for a model forward pass under static graph constraints.

- Parameters

    - - **model** – The language model. Used to access `model.config`,
`model.dtype`, `model.device`, and the RoPE layer.
- **input\_ids** (*torch.Tensor*  *|* *None*) – Token IDs of shape
`(batch, input_length)`. Mutually exclusive with
`inputs_embeds`.
- **attention\_mask** (*torch.Tensor*  *|* *None*) – Attention mask of shape
`(batch, input_length)`. If `None`, a mask of ones is
created.
- **past\_key\_values** (*List* *[* *Tuple* *[* *torch.Tensor* *,* *torch.Tensor* *]* *]*) – Cached
key/value pairs from previous steps. Pass an empty list for the
first step.
- **sequence\_length** (*int*) – Static sequence length (ARN) the model
expects per forward pass.
- **context\_length** (*int*) – Total context window size (KV cache capacity +
sequence length).
- **attention\_mask\_min** (*int*) – Minimum value used to clamp the causal
attention mask (large negative number to mask out positions).
Defaults to `-100`.
- **inputs\_embeds** (*torch.Tensor*  *|* *None*) – Pre-computed embeddings of
shape `(batch, input_length, hidden_dim)`. Mutually exclusive
with `input_ids`.
- **position\_ids** (*torch.Tensor*  *|* *None*) – Explicit position IDs of shape
`(batch, input_length)`. If `None`, they are derived from
the cumulative sum of the attention mask.
- **cache\_index** (*torch.Tensor*  *|* *None*) – KV cache write position
for the current step. If `None`, it is inferred from the
length of `past_key_values`.
- **pad\_token** (*int*) – Token ID used to pad `input_ids` to
`sequence_length`. Defaults to `0`.
- **config** (*PretrainedConfig*  *|* *None*) – Model configuration override. If
`None`, `model.config` is used. Defaults to `None`.
- **dtype** (*torch.dtype*  *|* *None*) – Data type for KV cache and attention mask
tensors. If `None`, `model.dtype` is used. Defaults to `None`.
- **\*\*kwargs** – Additional keyword arguments forwarded to
`_prepare_attention_mask()`.

- Returns

    - A dictionary with the following keys:

- `"input_ids"` or `"inputs_embeds"` (torch.Tensor): Input
tokens or embeddings padded to `sequence_length`.
- `"attention_mask"` (torch.Tensor): Causal attention mask of
shape `(batch, 1, sequence_length, context_length)`, clamped
to `attention_mask_min`.
- `"position_ids"` (Tuple[torch.Tensor, torch.Tensor]): RoPE
`(cos, sin)` embeddings derived from position IDs.
- `"past_key_values"` (List[Tuple[torch.Tensor, torch.Tensor]]):
KV cache padded to `context_length`.
- `"cache_index"` (torch.Tensor): Scalar tensor indicating the
current write position in the KV cache.

- Return type

    - Dict[str, Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor], List[Tuple[torch.Tensor, torch.Tensor]], Tuple[Tuple[torch.Tensor, torch.Tensor], …]]]

- Raises

    - **ValueError** – If both `input_ids` and `inputs_embeds` are
    provided, or if neither is provided.

- prepare\_inputs\_for\_generation(*input\_ids: LongTensor*, *past\_key\_values: Optional[Cache] = None*, *attention\_mask: Optional[LongTensor] = None*, *inputs\_embeds: Optional[FloatTensor] = None*, *cache\_position: Optional[LongTensor] = None*, *\*\*kwargs: Any*) → Dict[str, Optional[Union[Tensor, Cache]]]

    - Prepare model inputs for a single generation step.

Overrides the HuggingFace `GenerationMixin.prepare_inputs_for_generation`
to support models with static graph constraints. Slices away already-
processed tokens from `input_ids` or `inputs_embeds` based on the
number of tokens present in `past_key_values`.

- Parameters

    - - **input\_ids** (*torch.LongTensor*) – Token IDs of shape `(batch, seq_len)`.
Mutually exclusive with `inputs_embeds`.
- **past\_key\_values** (*Cache*  *|* *None*) – Cache object holding previously
computed key/value states. If `None`, no tokens have been
processed yet.
- **attention\_mask** (*torch.LongTensor*  *|* *None*) – Attention mask of shape
`(batch, seq_len)`. Passed through unchanged.
- **inputs\_embeds** (*torch.FloatTensor*  *|* *None*) – Pre-computed input
embeddings of shape `(batch, seq_len, hidden_dim)`. Mutually
exclusive with `input_ids`.
- **cache\_position** (*torch.LongTensor*  *|* *None*) – Unused; kept for API
compatibility with HuggingFace `GenerationMixin`.
- **\*\*kwargs** – Additional keyword arguments forwarded to the model.

- Returns

    - A dictionary containing
`"input_ids"` or `"inputs_embeds"` (sliced to unprocessed
tokens), `"attention_mask"`, and `"past_key_values"`.

- Return type

    - Dict[str, torch.Tensor | Cache | None]

- Raises

    - **ValueError** – If both `input_ids` and `inputs_embeds` are provided,
    or if neither is provided.

- *static* slice\_inputs\_for\_inference(*input\_ids: Optional[Tensor]*, *attention\_mask: Tensor*, *sequence\_length: int*, *inputs\_embeds: Optional[Tensor] = None*, *position\_ids: Optional[Tensor] = None*, *hidden\_states: Optional[Tensor] = None*, *\*\*kwargs*) → Iterable[Dict[str, Optional[Tensor]]]

    - Slice inputs into chunks suitable for inference.

Slices provided inputs based on the autoregressive window size and
yields per-chunk dictionaries containing aligned slices for input IDs or
embeddings, attention mask, position IDs, and optionally hidden states.

- Parameters

    - - **input\_ids** – Input token IDs of shape (batch, seq\_len). Provide either
`input_ids` or `inputs_embeds`, not both.
- **attention\_mask** – Attention mask of shape (batch, seq\_len). If not
provided, a mask of ones is created.
- **sequence\_length** – Maximum number of tokens the model consumes per step
(ARN length).
- **inputs\_embeds** – Input embeddings of shape (batch, seq\_len, hidden\_dim).
Provide either `input_ids` or `inputs_embeds`, not both.
- **position\_ids** – Position IDs of shape (batch, seq\_len).
- **hidden\_states** – Optional hidden states of shape (batch, seq\_len, …)
aligned with inputs.

- Yields

    - A dictionary with keys like `input_ids` or `inputs_embeds`,
`attention_mask`, `position_ids`, and optionally
`hidden_states` for each chunk.

- Raises

    - **ValueError** – If both `input_ids` and `inputs_embeds` are provided

- *class* qairt.experimental.pipeline.torch.llm.generation.generator.LLMGenerator(*model*, *tokenizer: PreTrainedTokenizer*, *sequence\_length: int*, *context\_length: int*, *config: Optional[PretrainedConfig] = None*, *attention\_mask\_min: int = -100*, *bypass\_adapted\_forward: bool = False*, *\*\*kwargs*)

    - Bases: [`LLMGenerationMixin`](https://docs.qualcomm.com/doc/80-87189-2/topic/qairt-pipeline-generation.html#qairt.experimental.pipeline.torch.llm.generation.generator.LLMGenerationMixin), `Module`

- *static* can\_generate() → bool

    - 

- *property* config*: PretrainedConfig*

    - 

- *property* device*: device*

    - 

- *property* dtype*: dtype*

    - 

- forward(*input\_ids: Optional[Tensor] = None*, *attention\_mask: Optional[Tensor] = None*, *past\_key\_values: Optional[DynamicCache] = None*, *inputs\_embeds: Optional[FloatTensor] = None*, *position\_ids: Optional[Tensor] = None*, *hidden\_states: Optional[Tensor] = None*, *cache\_index: Optional[Tensor] = None*, *\*\*kwargs*) → CausalLMOutputWithPast

    - Run a full forward pass over the input sequence.

Slices the input into chunks of `sequence_length` tokens, prepares
static-shape inputs for each chunk via `prepare_inputs()`, runs the
wrapped model, and accumulates logits and KV cache across all chunks.

- Parameters

    - - **input\_ids** (*torch.Tensor*  *|* *None*) – Token IDs of shape
`(batch, seq_len)`. Mutually exclusive with
`inputs_embeds`.
- **attention\_mask** (*torch.Tensor*  *|* *None*) – Attention mask of shape
`(batch, seq_len)`. If `None`, a mask of ones is created
automatically.
- **past\_key\_values** (*DynamicCache*  *|* *None*) – HuggingFace cache object
holding previously computed key/value states. If `None` or
empty, the KV cache is initialised from scratch.
- **inputs\_embeds** (*torch.FloatTensor*  *|* *None*) – Pre-computed input
embeddings of shape `(batch, seq_len, hidden_dim)`. Mutually
exclusive with `input_ids`.
- **position\_ids** (*torch.Tensor*  *|* *None*) – Explicit position IDs of shape
`(batch, seq_len)`. If `None`, they are derived from the
attention mask.
- **hidden\_states** (*torch.Tensor*  *|* *None*) – Optional hidden states of
shape `(batch, seq_len, hidden_dim)` aligned with the input.
Passed through to `slice_inputs_for_inference()`.
- **cache\_index** (*torch.Tensor*  *|* *None*) – KV cache write position
override. If `None`, the position is inferred from the
current cache length.
- **\*\*kwargs** – Additional keyword arguments forwarded to
`prepare_inputs()`.

- Returns

    - A HuggingFace output object containing:

- `logits` (torch.FloatTensor): Concatenated logits of shape
`(batch, seq_len, vocab_size)` cast to `float32`.
- `past_key_values` (Tuple[Tuple[torch.Tensor, …], …] | None):
Updated KV cache as a tuple of `(key, value)` pairs, one per
layer, moved to `self.device`. `None` if no KV cache was
produced.

- Return type

    - CausalLMOutputWithPast

- Raises

    - **ValueError** – If both `input_ids` and `inputs_embeds` are
    provided, or if neither is provided.

- *property* main\_input\_name*: str*

    -

- qairt.experimental.pipeline.torch.llm.generation.generator.get\_past\_keyval\_with\_shift(*past\_key\_vals: List[Tuple[Tensor, Tensor]]*, *new\_key\_vals: List[Tuple[Tensor, Tensor]]*, *length: int*, *device: device = device(type='cpu')*, *dtype: dtype = torch.float32*, *transposed\_key\_cache: bool = False*) → List[Tuple[Tensor, Tensor]]

    - Combine past\_key\_vals with new\_key\_vals and clip to at most `length` tokens of context.

Concatenates existing cached key/value tensors with newly computed ones,
then clips the result so the sequence dimension does not exceed `length`.

- When `transposed_key_cache` is `True`:
    - Key tensors have shape `(batch, heads, head_dim, seq_len)` and are
concatenated on `dim=3`.
Value tensors have shape `(batch, heads, seq_len, head_dim)` and are
concatenated on `dim=2`.

- When `transposed_key_cache` is `False`:
    - Both key and value tensors are concatenated on `dim=2`.

- Parameters

    - - **past\_key\_vals** – Previously cached key/value pairs, each element being a
`(key, value)` tuple. Pass an empty list when there is no prior
cache.
- **new\_key\_vals** – Newly computed key/value pairs for the current step, each
element being a `(key, value)` tuple. Must have the same number of
layers as `past_key_vals` when `past_key_vals` is non-empty.
- **length** – Maximum number of tokens to retain in the sequence dimension
after concatenation.
- **device** – Target device for the output tensors. Defaults to
`torch.device("cpu")`.
- **dtype** – Target data type for the output tensors. Defaults to
`torch.float32`.
- **transposed\_key\_cache** – If `True`, key tensors are treated as
transposed `(batch, heads, head_dim, seq_len)` and clipped on
`dim=3`. Defaults to `False`.

- Returns

    - A list of `(key, value)` tuples with the same number of layers as
`new_key_vals`, where each tensor’s sequence dimension is clipped to
at most `length` tokens and cast to `dtype` on `device`.

## Evaluator

Core evaluation orchestrator.

- qairt.experimental.pipeline.torch.llm.evaluation.evaluator.run\_evaluation(*metrics\_config: list[dict[str, Any]]*, *model: Optional[Any] = None*, *tokenizer: Optional[Any] = None*, *context\_length: int = 2048*, *model\_forward\_kwargs: Optional[dict[str, Any]] = None*, *output\_dir: Optional[str] = None*, *\*\*kwargs*) → dict[str, float]

    - Run all configured metrics and return `{display_name: score}`.

- Parameters

    - - **metrics\_config** – List of metric specification dicts. Each dict must
contain a `"name"` key (e.g. `"PPL"`) plus any metric-specific
options.
- **model** – The model to evaluate (`torch.nn.Module`).
- **tokenizer** – Tokenizer compatible with *model*.
- **context\_length** – Maximum sequence length for evaluation chunks.
- **model\_forward\_kwargs** – Extra keyword arguments forwarded to
`model.forward()` (e.g. `use_cache`).
- **output\_dir** – If provided, results are written as JSON and plain-text
tables to this directory.
- **\*\*kwargs** – Additional options forwarded to each metric’s
`evaluate()` call.

- Returns

    - Dictionary mapping `display_name` to the scalar metric score.

Last Published: Jun 19, 2026

[Previous Topic
SpinQuant\_Recipe.result\_type](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/qairt-pipeline-quantization.md) [Next Topic
Common Utilities](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/qairt-pipeline-common.md)