# Advanced Usage

This guide is for advanced users who bypass the pipeline entirely and
orchestrate building blocks manually. This gives maximum control over each
step of the workflow.

import torch
    from transformers import AutoTokenizer
    
    from qairt.experimental.pipeline.torch.common.adaptations.adapter import Adapter
    from qairt.experimental.pipeline.torch.llm.generation.generator import LLMGenerator
    from qairt.experimental.pipeline.torch.llm.loader.auto_classes import (
        QcAutoConfig,
        QcAutoModelForCausalLM,
    )
    from qairt.experimental.pipeline.torch.llm.quantization.recipes.defaults import LPBQ_SeqMSE_Recipe
    from qairt.experimental.pipeline.torch.llm.evaluation.evaluator import run_evaluation
    from qairt.gen_ai_api.gen_ai_builder_factory import GenAIBuilderFactory
    Copy to clipboard

## Step 1: Create Qualcomm Config

`QcAutoConfig.from_pretrained()` loads the HuggingFace config and wraps it
into a Qualcomm-specific config. The `model_config_overrides` dict sets
Qualcomm-specific attributes like `transposed_key_cache`.

qc_config = QcAutoConfig.from_pretrained(
        "meta-llama/Llama-3.2-3B-Instruct",
        model_config_overrides={
            "return_new_key_value_only": True,
            "transposed_key_cache": True,
            "input_tokens_per_inference": 4073,
        },
    )
    Copy to clipboard

## Step 2: Load Tokenizer

tokenizer = AutoTokenizer.from_pretrained(
        "meta-llama/Llama-3.2-3B-Instruct", use_fast=True, trust_remote_code=True
    )
    tokenizer.model_max_length = 8273
    Copy to clipboard

## Step 3: Load Model with Qualcomm Re-authoring

Two approaches are available:

- `QcAutoModelForCausalLM` — model-agnostic, auto-dispatches to the correct
Qc model class based on HuggingFace config
- `QcLlamaForCausalLM` — model-specific, direct Llama class (Llama only)

Both delegate to the same re-authoring pipeline.

model = QcAutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3.2-3B-Instruct",
        model_reauthoring=True,
        qc_config=qc_config,
        torch_dtype=torch.float32,
        attn_implementation="eager",
    )
    Copy to clipboard

## Step 4: Apply Backend Adaptations

The `Adapter` applies backend-specific module transformations (e.g.,
HTP-optimized attention, custom normalization).

model = Adapter.apply_adaptations(model, backend="HTP")
    Copy to clipboard

## Step 5: Create Generator for Quantization

The `LLMGenerator` wraps the model for use during quantization calibration.

generator = LLMGenerator(
        model=model,
        tokenizer=tokenizer,
        sequence_length=4073,
        context_length=8273,
        config=model.config,
    )
    Copy to clipboard

## Step 6: Run Quantization Recipe

Instantiate the recipe class directly for full control over parameters.
Alternatively, use `load_recipe("lpbq_seqmse")` for a config-driven approach.

recipe = LPBQ_SeqMSE_Recipe()
    quant_result = recipe.apply(
        model=model,
        tokenizer=tokenizer,
        generator=generator,
        context_length=8273,
        sequence_length=4073,
    )
    Copy to clipboard

## Step 7: Evaluate Quantized Model Quality

`run_evaluation()` computes metrics on the quantized model. Lower
perplexity indicates better model quality after quantization.

eval_results = run_evaluation(
        metrics_config=[{"name": "PPL", "dataset_name": "wikitext"}],
        model=quant_result.model,
        tokenizer=tokenizer,
        context_length=8273,
    )
    Copy to clipboard

## Step 8: Export Quantized Artifacts

quant_result.export("./quantized", filename_prefix="model")
    Copy to clipboard

## Step 9: Build LLM Container

The `GenAIBuilderFactory` compiles the quantized model into an optimized
container for on-device inference.

builder = GenAIBuilderFactory.create(
        pretrained_model_path="./quantized",
        backend_type="HTP",
        cache_root="./builder_cache",
    )
    container = builder.build()
    Copy to clipboard

## Step 10: Generate Text on Device

executor = container.get_executor(device=device)
    result = executor.generate("Hello, how are you?")
    result.print()
    Copy to clipboard

## When to Use This Approach

This approach is for users who need to:

- Customize individual steps beyond what stage configs expose
- Insert custom logic between steps (e.g., custom model surgery)
- Use model-specific classes (e.g., `QcLlamaForCausalLM`) directly
- Integrate with external tooling at each checkpoint

For most users, the pipeline API ([Getting Started with Pipeline](https://docs.qualcomm.com/doc/80-87189-2/topic/pipeline_getting_started.html)) or
programmatic customization ([Customizing the Pipeline](https://docs.qualcomm.com/doc/80-87189-2/topic/pipeline_customization.html)) provides sufficient
control with less boilerplate.

Last Published: Jun 19, 2026

[Previous Topic
Next Steps](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/pipeline_customization.md) [Next Topic
Migration Guide: Notebook → Pipeline](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/pipeline_migration.md)