# Speculative Decoding Tutorial

This tutorial demonstrates how to enable one of the three supported speculative decoding
methods (LADE, SSD, or Eaglet) when building a Large Language Model (LLM) for a
Snapdragon device.  It follows the same structure as the [LLM Inference on HTP](https://docs.qualcomm.com/doc/80-87189-2/topic/genai_builder.html#genai-builder) tutorial
but adds the speculative configuration step.

Note

A version of this tutorial will be available in the QAIRT SDK at
`examples/QAIRT/python/speculative_decoding_tutorial.py`.

## Configurations

- **Host OS**: Linux (x86\_64) with ADB (Android Debug Bridge) installed.
- **Target Device**: Snapdragon Android device (e.g., SM8750).
- **Processor**: Qualcomm NPU (HTP backend).
- **Model**: Meta Llama 3-8B-Instruct (quantized ONNX export).

## Step 1: Setup

We recommend a machine with at least 64 GB of RAM.  If you have less, increase swap
memory to avoid out-of-memory failures.  The workflow may take **≈ 40 minutes** on a
machine with &lt; 64 GB RAM.

import os
    import json
    from pathlib import Path
    from typing import Optional, cast
    
    from qairt.api.configs.common import BackendType
    from qairt.api.configs.device import Device
    from qairt.gen_ai_api.builders.gen_ai_builder_htp import GenAIBuilderHTP
    from qairt.gen_ai_api.chat.chat_templates import NullChatTemplate
    from qairt.gen_ai_api.chat.prompt_objects import PromptObject
    from qairt.gen_ai_api.configs.eaglet_config import EagletBuilderConfig
    from qairt.gen_ai_api.configs.lade_config import LadeBuilderConfig
    from qairt.gen_ai_api.configs.ssd_config import SsdBuilderConfig
    from qairt.gen_ai_api.containers.gen_ai_container import GenAIContainer
    from qairt.gen_ai_api.executors.t2t_executor import T2TExecutor
    from qairt.gen_ai_api.gen_ai_builder_factory import GenAIBuilderFactory
    from qairt.utils.loggers import get_logger
    from qti.aisw.tools.core.utilities.devices.api.device_definitions import DevicePlatformType
    
    logger = get_logger(__name__)
    Copy to clipboard

Tip

Set the environment variable `QAIRT_TMP_DIR` to define an alternative temporary
directory path.  This prevents the default system temp directory from filling up.

os.environ["QAIRT_TMP_DIR"] = "./llm_scratch/"
    Copy to clipboard

Tip

Define a cache root to reuse intermediate artifacts between builds.

CACHE_ROOT = "./llama3_cache"
    Copy to clipboard

## Step 2: Define Model Export Path and Speculative Settings

# Path to the exported model directory (replace <your_path> with the actual path)
    MODEL_EXPORTS = "./llama_3_8b/<your_path>"
    
    # Choose one of: "lade", "ssd", "eaglet"
    SPECULATIVE_TYPE = "lade"
    
    # Optional: path to a JSON file containing a full speculative config
    SPECULATIVE_CONFIG_PATH: Optional[str] = None  # e.g. "./my_speculative_config.json"
    
    # Prompt to generate
    PROMPT = "briefly explain speculative decoding and its benefits."
    Copy to clipboard

## Step 3: Build the GenAIBuilder with Speculative Decoding Enabled

cache_root = Path(CACHE_ROOT)
    cache_root.mkdir(parents=True, exist_ok=True)
    
    builder = GenAIBuilderFactory.create(
        Path(MODEL_EXPORTS),
        BackendType.HTP,
        cache_root=cache_root,
    )
    
    # Target device - example uses Snapdragon SM8750
    android_serial = os.getenv("ANDROID_SERIAL")
    android_hostname = os.getenv("ANDROID_HOSTNAME")
    device_id = f"{android_serial}@{android_hostname}" if android_hostname else android_serial
    android_device = Device(identifier=device_id, type=DevicePlatformType.ANDROID)
    
    soc_details = f"chipset:{android_device.get_chipset()}"
    builder.set_targets([soc_details])
    Copy to clipboard

### Option 1: LADE

*Look-ahead decoding* - LADE predicts multiple future tokens and selects the most promising continuation, reducing latency while preserving generation quality.

# LADE configuration (used in this tutorial)
    builder.speculative_config = LadeBuilderConfig(
        window=8,
        ngram=5,
        gcap=8,
    )
    Copy to clipboard

### Option 2: SSD

*Self-speculative decoding* - SSD generates speculative tokens using the model itself as a draft, improving efficiency without external models.

# builder.speculative_config = SsdBuilderConfig(
    #     forecast_token_count=4,
    #     forecast_prefix=16,
    #     branches=[4, 4],
    #     forecast_prefix_name="", # deprecated; will be removed in qairt 2.34
    #     ssd_tensor_file="./ssd_tensor.pt",
    #     n_streams=1,
    # )
    Copy to clipboard

### Option 3: Eaglet

*Modification of the EAGLE algorithm* - Eaglet adapts the Extrapolation Algorithm for Greater Language-model Efficiency (EAGLE) to generate speculative tokens, improving decoding speed while maintaining quality.

# builder.speculative_config = EagletBuilderConfig(
    #     draft_len=6,
    #     n_branches=6,
    #     max_tokens_target_can_evaluate=32,
    #     draft_kv_cache=True,
    #     draft_model_path="./draft_model.onnx",
    #     draft_token_map="./draft_token_map.json",
    # )
    Copy to clipboard

## Step 4: Build the Container and Run Inference

# Build the container (returns a GenAIContainer)
    container: GenAIContainer = builder.build()
    
    # Create an executor for the Android device
    executor = cast(T2TExecutor, container.get_executor(android_device))
    
    # Generate text
    result = executor.generate(PROMPT)
    
    # Output
    print("\n--- Generated Text ---")
    print(result.generated_text)
    
    if result.metrics:
        print("\n--- Metrics ---")
        print(result.metrics)
    
    # Clean up device artifacts
    executor.clean_environment()
    Copy to clipboard

## Step 5: Save the Container (Optional)

container.save("./speculative_container", exist_ok=True)
    logger.info("Speculative decoding tutorial completed successfully.")
    Copy to clipboard

Last Published: May 26, 2026

[Previous Topic
Low-Rank Adaptation (LoRA) Tutorial](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/lora_tutorial.md) [Next Topic
Guides](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/guides.md)