# Getting Started with Pipeline

This guide covers the two most common ways to use the pipeline: through the
HuggingFace interface (Level 1) and through the pipeline’s Python API with a
YAML recipe (Level 2).

## Level 1: HuggingFace User

The most familiar interface for HuggingFace users. Pass a
`quantization_config` to `QcAutoModelForCausalLM.from_pretrained()` and
the HF quantizer lifecycle handles reauthoring, quantization, and export
automatically.

from transformers import AutoTokenizer
    
    from qairt.experimental.pipeline.torch.llm.loader.auto_classes import (
        QcAutoConfig,
        QcAutoModelForCausalLM,
    )
    from qairt.experimental.pipeline.torch.llm.quantization.techniques.hf.lpbq_seqmse_hf_quantizer import (
        LPBQSeqMSEHfConfig,
    )
    
    # 1. Build QC config
    qc_config = QcAutoConfig.from_pretrained(
        "meta-llama/Llama-3.2-3B-Instruct",
        model_config_overrides={
            "return_new_key_value_only": True,
            "transposed_key_cache": True,
            "input_tokens_per_inference": 4073,
        },
    )
    
    # 2. Tokenizer
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
    tokenizer.model_max_length = 8273
    
    # 3. Build quantization config
    lpbq_seqmse_config = LPBQSeqMSEHfConfig(
        context_length=8273,
        sequence_length=4073,
        tokenizer=tokenizer,
        lpbq={
            "decompressed_bw": 8,
            "block_size": 64,
            "block_grouping": 1,
            "symmetric": True,
            "num_calibration_batches": 20,
        },
        seqmse={"num_batches": 20, "num_candidates": 30},
    )
    
    # 4. Load model — quantization runs automatically via HF lifecycle
    model = QcAutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3.2-3B-Instruct",
        qc_config=qc_config,
        quantization_config=lpbq_seqmse_config,
        device_map="cuda",
    )
    
    # 5. Save quantized model
    model.save_pretrained("./output")
    Copy to clipboard

How it works: passing `quantization_config` to `from_pretrained()` triggers
the HF quantizer lifecycle:

1. `_process_model_before_weight_loading` — applies Qualcomm reauthoring
2. HF loads weights into model
3. `_process_model_after_weight_loading` — runs the quantization recipe

## Level 2: Novice Pipeline User

Uses the `LLMPipeline` with a YAML recipe. Three method calls run the
entire workflow: load, quantize, compile, and generate.

from qairt.experimental.pipeline.torch.llm.pipeline import LLMPipeline
    
    pipe = LLMPipeline.from_pretrained(
        "meta-llama/Llama-3.2-3B-Instruct",
        recipe="llama32_recipe.yaml",
    )
    pipe.construct()
    
    result = pipe.generate("Hello, how are you?", device=device)
    result.print()
    Copy to clipboard

See [Pipeline Configuration](https://docs.qualcomm.com/doc/80-87189-2/topic/pipeline_configuration.html) for the full recipe YAML schema and
per-stage configuration reference.

## Next Steps

- [Pipeline Configuration](https://docs.qualcomm.com/doc/80-87189-2/topic/pipeline_configuration.html) — Learn the YAML recipe schema
- [Customizing the Pipeline](https://docs.qualcomm.com/doc/80-87189-2/topic/pipeline_customization.html) — Modify stage configs programmatically, add
custom stages, or inject custom dataloaders
- [Advanced Usage](https://docs.qualcomm.com/doc/80-87189-2/topic/pipeline_expert_usage.html) — Bypass the pipeline and work with building
blocks directly

Last Published: Jun 19, 2026

[Previous Topic
Relationship to Gen AI Builder](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/pipeline_overview.md) [Next Topic
Pipeline Configuration](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/pipeline_configuration.md)