# Pipeline Configuration

This guide documents the full YAML recipe schema, per-stage configuration
reference, and all available quantization recipes.

## Full Annotated Recipe

# ── Pipeline-level configuration ──────────────────────────────────
    model_id_or_path: meta-llama/Llama-3.2-3B-Instruct  # HF model ID or local path
    backend: HTP                    # Compilation backend (e.g., HTP)
    soc_details: "chipset:SM8850"   # Target SoC for compilation
    
    enable_observers: true          # Enable stage execution profiling
    enable_cache: true              # Enable output caching for resume
    cache_dir: ./pipeline_cache_dir # Directory for cached artifacts
    checkpoint: quantization        # Save checkpoint after this stage (optional)
    log_level: info                 # Logging level (debug, info, warning, error)
    
    # Cross-cutting generation/evaluation config (available to all stages)
    generator_config:
      sequence_length: 4073         # Decode sequence length
      context_length: 8273          # Prefill context length
    
    evaluator_config:
      context_length: 4096
      metrics:
        - name: perplexity
          dataset: wikitext
          display_name: PPL_wikitext
    
    # ── Per-stage configuration ────────────────────────────────────────
    stages:
      model_loader:
        model_reauthoring: true
        apply_default_adaptations: true
        execution_environment: gpu
    
        hf_tokenizer_kwargs:
          trust_remote_code: true
    
        hf_pretrained_kwargs:
          dtype: float32
          attn_implementation: eager
    
        model_config_overrides:
          num_logits_to_keep: 0
          transposed_key_cache: true
          return_new_key_value_only: true
          input_tokens_per_inference: 4073
    
      quantization:
        recipe_name: lpbq_seqmse
    
      genai_builder:
        native_kv: false
        weight_sharing: true
    
        transform_options:
          arn: [1, 128]
          context_length: [4096]
          split.split_embedding: true
          split.split_lm_head: true
    
        calibration_options:
          act_precision: 16
          bias_precision: 32
    
        compile_options:
          graphs.hvx_threads: 8
          graphs.vtcm_size_in_mb: 8
          graphs.optimization_type: 3
          devices.cores.perf_profile: burst
    Copy to clipboard

## Pipeline-Level Configuration Reference

| Key | Type | Description |
| --- | --- | --- |
| `model_id_or_path` | `str` | HuggingFace model ID or local filesystem path |
| `backend` | `str` | Compilation backend (e.g., `"HTP"`) |
| `soc_details` | `str` | Target SoC specification (e.g., `"chipset:SM8850"`) |
| `enable_cache` | `bool` | Enable stage output caching to disk (default: `false`) |
| `cache_dir` | `str` | Directory for cache artifacts (default: `"./workspace"`) |
| `checkpoint` | `str` | Stage name after which to save a checkpoint (optional) |
| `enable_observers` | `bool` | Enable stage observers for profiling (default: `false`) |
| `log_level` | `str` | Logging level: `debug`, `info`, `warning`, `error` |
| `generator_config` | `dict` | Cross-cutting generation settings (see below) |
| `evaluator_config` | `dict` | Evaluation configuration with metrics list |

### GeneratorConfig

| Key | Type | Description |
| --- | --- | --- |
| `sequence_length` | `int` | Token decode sequence length |
| `context_length` | `int` | Prefill context length (max input tokens) |

### EvaluatorConfig

| Key | Type | Description |
| --- | --- | --- |
| `context_length` | `int` | Context length for evaluation pass |
| `output_dir` | `str` | Directory to write evaluation results (optional) |
| `metrics` | `list` | List of metric configurations (name, dataset, display\_name) |

## Per-Stage Configuration

### model\_loader

| Key | Type | Description |
| --- | --- | --- |
| `model_reauthoring` | `bool` | Apply model-specific reauthoring for HTP compatibility |
| `apply_default_adaptations` | `bool` | Apply default model adaptations (linear→conv, etc.) |
| `execution_environment` | `str` | Execution environment: `"cpu"`, `"gpu"`, or `"device"` |
| `hf_tokenizer_kwargs` | `dict` | Keyword arguments passed to `AutoTokenizer.from_pretrained()` |
| `hf_pretrained_kwargs` | `dict` | Keyword arguments passed to `AutoModelForCausalLM.from_pretrained()` |
| `model_config_overrides` | `dict` | Fields to override on the HuggingFace model config |

### quantization

| Key | Type | Description |
| --- | --- | --- |
| `recipe_name` | `str` | Name of the quantization recipe to use (see recipes below) |
| `technique_kwargs` | `dict` | Override parameters for specific techniques within the recipe |

### genai\_builder

| Key | Type | Description |
| --- | --- | --- |
| `native_kv` | `bool` | Use native KV-cache implementation |
| `weight_sharing` | `bool` | Enable weight sharing across graphs |
| `transform_options` | `dict` | ONNX transform options (ARN, context length, split settings) |
| `calibration_options` | `dict` | Backend calibration options (precision settings) |
| `compile_options` | `dict` | Backend compile options (HVX threads, VTCM, optimization) |

## Quantization Recipes

For full details on all available quantization recipes, see
[Quantization Recipes](https://docs.qualcomm.com/doc/80-87189-2/topic/pipeline_quantization_recipes.html).

Last Published: Jun 19, 2026

[Previous Topic
Next Steps](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/pipeline_getting_started.md) [Next Topic
Quantization Recipes](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/pipeline_quantization_recipes.md)