# Pipeline Overview

## What is the Pipeline?

The Pipeline API (`qairt.experimental.pipeline`) provides a declarative,
stage-based orchestration system for model optimization workflows. It replaces
the traditional manual approach — where users run 50+ cells for model loading,
adaptation, quantization, and compilation — with a single YAML recipe and a
few lines of Python.

from qairt.api.configs.device import Device
    from qairt.experimental.pipeline.torch.llm.pipeline import LLMPipeline
    
    pipe = LLMPipeline.from_pretrained(
        "meta-llama/Llama-3.2-3B-Instruct",
        recipe="llama32_recipe.yaml"
    )
    pipe.construct()
    
    device = Device(type="android", identifier="<serial>@<hostname>")
    result = pipe.generate("Hello, how are you?", device=device)
    result.print()
    Copy to clipboard

## Architecture

The pipeline is organized in three layers. At the bottom are the **buildingblocks** — standalone functional components that do the actual work. In the
middle are **stages** — thin wrappers that encapsulate one or more building
blocks behind a uniform interface. At the top is the **pipeline orchestrator**
— which reads a YAML recipe and runs stages in sequence.

+----------------------------------------------------------------------+
    |                      Pipeline Orchestrator                            |
    |                                                                       |
    |  LLMPipeline: from_pretrained() -> construct() -> generate/evaluate  |
    |  Configured by: YAML Recipe                                           |
    +----------------------------------+-----------------------------------+
                                       |
                                       | runs
                                       v
    +----------------------------------------------------------------------+
    |                           Stages                                      |
    |                                                                       |
    |  model_loader --> quantization_opt --> quantization --> genai_builder |
    |  (thin wrappers around building blocks)                               |
    +----+-----------------+--------------------+------------------+--------+
         |                 |                    |                  |
         | wraps           | wraps              | wraps            | wraps
         v                 v                    v                  v
    +-----------+   +---------------+   +---------------+   +-----------+
    | Model     |   | Quantization  |   | Quantization  |   | GenAI     |
    | Loading   |   | Optimization  |   | Recipes       |   | Builder   |
    |           |   | Techniques    |   |               |   |           |
    | - Loader  |   |               |   | - LPBQ        |   | - Build   |
    | - Re-     |   | - SpinQuant   |   | - GPTAQ       |   | - Export  |
    |   author  |   | - SeqMSE      |   | - HF Quant    |   | - Run     |
    | - Adapt   |   | - AdaScale    |   |               |   |           |
    +-----------+   +---------------+   +---------------+   +-----------+
         Building Blocks (standalone, usable without the pipeline)
    Copy to clipboard

Users at different levels interact with different layers:

- **Level 1–3** (most users): Interact with the **pipeline orchestrator** via
recipes and Python API. The pipeline runs stages automatically.
- **Level 4** (advanced users): Interact with the **building blocks** directly,
orchestrating model loading, quantization, and compilation step-by-step
without the pipeline abstraction.

## Building Blocks

Building blocks are the foundational functional components. They can be used
independently of the pipeline — advanced users work with them directly for
maximum control over each step.

### Model Loading

Loads a HuggingFace model and applies Qualcomm-specific transformations.

- **Loader** — `QcAutoModelForCausalLM` / `QcAutoConfig` auto-dispatch to
the correct model-specific class based on HuggingFace config
- **Reauthoring** — Model-specific architecture transformations for on-device
execution (e.g., KV cache layout, attention re-implementation)
- **Adaptations** — Backend-specific module adaptations applied via `Adapter`
(e.g., HTP-optimized attention, custom normalization)

### Quantization

Quantizes model weights and activations for efficient on-device inference.
The quantization system is organized in layers:

1. **Base classes** — `AimetQuantizer`, `AimetOptimizer`, and
`AimetMixin` provide the foundation for all quantization techniques.
2. **Techniques** — Individual quantization and optimization algorithms
built on the base classes (SpinQuant, SeqMSE, LPBQ, AdaScale, GPTAQ).
3. **Recipes** — Pre-configured combinations of techniques that represent
complete quantization workflows (e.g., `LPBQ_SeqMSE_Recipe`,
`SpinQuant_AdaScale_Recipe`).

### GenAI Builder

Compiles a quantized model into an optimized container for on-device inference
on Qualcomm hardware. Handles graph transformations, weight packing, and
multi-graph construction.

## Stages

Stages are thin wrappers around building blocks. Each stage encapsulates one
step of the workflow behind a uniform interface, enabling the pipeline to
run them in sequence.

Built-in stages (in execution order):

1. `model_loader` — wraps Model Loading + Adaptations
2. `quantization_opt` — wraps pre-quantization optimization techniques
(SpinQuant, SeqMSE, AdaScale, GPTAQ)
3. `quantization` — wraps quantization recipes and the base quantizer
4. `genai_builder` — wraps GenAI Builder compilation

## Pipeline Core Components

These components form the orchestration infrastructure that ties stages
together:

- **Pipeline**
    - The top-level orchestrator (`LLMPipeline`). Reads the recipe, instantiates
stages, resolves dependencies, and runs stages in sequence.

- **Recipe**
    - A YAML file that fully specifies a pipeline run: model identity, target
hardware, and per-stage configuration. See [Pipeline Configuration](https://docs.qualcomm.com/doc/80-87189-2/topic/pipeline_configuration.html)
for the full schema.

- **Registry**
    - The `StageRegistry` holds all available stages. Built-in stages are
registered at import time; custom stages use the `@register_stage`
decorator.

- **Cache**
    - When `enable_cache: true`, stage outputs are cached to disk. Subsequent
runs skip stages whose config (and upstream outputs) haven’t changed.
Cache keys form a chain — any config change invalidates the affected stage
and all downstream stages.

- **Manifest**
    - The `PipelineManifest` tracks completed stages and their artifact paths,
enabling `LLMPipeline.load(cache_dir)` to resume from any point.

- **Observers**
    - The observer pattern (`StageObserver`) allows monitoring stage execution.
The built-in `StageProfilerObserver` records timing and memory metrics.

## Evaluation

The pipeline supports evaluation after any stage via `pipe.evaluate()`.
Configure metrics in the recipe’s `evaluator_config` section, then call
`evaluate()` after `construct()` completes any number of stages. This
enables comparing model quality at different points in the workflow (e.g.,
perplexity before and after quantization).

## Supported Models

The LLM pipeline currently supports:

- **Llama** — Meta Llama 3.x family (`meta-llama/Llama-3.2-3B-Instruct`, `meta-llama/Llama-3.2-1B-Instruct`, `meta-llama/Llama-3.1-8B-Instruct`)
- **Phi4** — Microsoft Phi-4 family

Each model family has dedicated:

- Module mappings for HTP-specific transformations
- Reauthoring logic for model architecture adaptation
- Default recipe YAML with tuned parameters

## Relationship to Gen AI Builder

The Pipeline API wraps the Gen AI Builder as one stage (`genai_builder`).
The Gen AI Builder remains the compilation engine; the pipeline adds:

- Automated model loading and preparation (upstream stages)
- Quantization recipe management
- Caching, checkpointing, and resume
- Evaluation and generation utilities
- A single YAML configuration surface

Users who only need compilation can continue using the Gen AI Builder directly.
The pipeline is for the full workflow: load → quantize → compile → execute.
Users can evaluate the pipeline after any stage to assess model quality at
different points in the optimization process.

Last Published: Jun 19, 2026

[Previous Topic
Pipeline (Experimental)](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/guides.md) [Next Topic
Getting Started with Pipeline](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/pipeline_getting_started.md)