# Native inference on OE-Linux

This tutorial demonstrates how to execute models natively on OE-Linux devices using the QAIRT Python API.
Native inference runs directly on the OE-Linux device without requiring a remote connection, making it ideal
for edge deployment scenarios.

Note

This tutorial assumes you are running the Python script directly on an OE-Linux device.
For remote inference from a host machine, see the [Mobilenet V2 Remote Inference on OE-Linux](https://docs.qualcomm.com/doc/80-87189-2/topic/remote_inference.html#remote-inference-oelinux) tutorial.

The parameters for this tutorial are as follows:

> 
> 
> - Model: InceptionV3 (quantized DLC format)
> - Configurations:
> 
>     - Host OS: OE-Linux
>     - Target device: Local OE-Linux device (Linux-aarch64 platforms e.g., [QCS6490](https://www.qualcomm.com/internet-of-things/products/q6-series))
>     - Processor: Qualcomm NPU
>     - Backend: HTP

## Prerequisites

Before starting this tutorial, ensure you have:

1. A quantized model in DLC format. You can either:

    - Use the quantized model created in the [Mobilenet V2 Remote Inference on OE-Linux](https://docs.qualcomm.com/doc/80-87189-2/topic/remote_inference.html#remote-inference-oelinux) tutorial (`mobilenet_v2_quantized.dlc`), or
    - Create your own quantized model following the example in the QAIRT SDK `examples/QAIRT/python/quantization_tutorial.py`
2. Input data for inference (either individual inputs or an input list file)
3. OE-Linux device with one of the following:

    - Device flashed with an image that includes QAIRT Python APIs and required libraries from the QAIRT SDK, or
    - Device with access to the qairt-dev (instructions can be found [here](https://docs.qualcomm.com/bundle/publicresource/topics/80-87189-2/setup.html))

## Step 1. Setup

Import the necessary libraries. This tutorial uses the QAIRT Python API to load and execute models natively.

import os
    import platform
    import time
    from pathlib import Path
    import numpy as np
    import qairt
    Copy to clipboard

## Step 2. Prepare model and inputs

Set up paths to your quantized DLC model and input data. You can use either individual input arrays or an input list file.

# Set paths to your model and inputs
    model_path_dlc = Path("path/to/your/model_quantized.dlc")
    input_list_path = "path/to/your/input_list.txt"
    # Set output directory
    output_dir = Path("output")
    output_dir.mkdir(exist_ok=True)
    Copy to clipboard

## Step 3. Load the model

Load the quantized DLC model using the `qairt.load()` function. This creates a model object that can be executed directly.

loaded_model_dlc = qairt.load(model_path_dlc)
    # See model information
    print("Model DLC information:")
    print(loaded_model_dlc.module.info)
    Copy to clipboard

The model information will display details about the model’s inputs, outputs, and configuration.

## Step 4. Single input execution

Execute the model with a single input. You can generate random input data or use your own preprocessed data.

def generate_input_data():
        # Generate random data matching your model's input shape
        # Adjust dimensions based on your model requirements
        return np.random.rand(1, 299, 299, 3).astype(np.float32)
    # Execute the model with a single input
    print("Executing model with single input:")
    exec_result = loaded_model_dlc(inputs=generate_input_data(), backend="HTP")
    print(exec_result)
    Copy to clipboard

## Step 5. Batch execution with input list

For processing multiple inputs, you can use an input list file. This is more efficient than running individual inferences.

# Execute using input list
    print("Execution result for running DLC model using input list:")
    exec_result_batch = loaded_model_dlc(inputs=input_list_path, backend="HTP")
    print(exec_result_batch)
    Copy to clipboard

## Step 6. Stream execution (initialize once, run multiple times)

For optimal performance when running multiple inferences, use stream execution. This initializes the backend once
and reuses it for multiple runs, significantly reducing overhead.

Note

Stream execution provides significant performance benefits:

- **Initialize**: Set up the backend once (`initialize()`)
- **Run**: Execute multiple inferences without re-initialization
- **Destroy**: Clean up resources when done (`destroy()`)

print("\nStream execution example:")
    print("Initializing backend once and running model multiple times")
    # Initialize backend once
    loaded_model_dlc.initialize(backend="HTP")
    # Run multiple inferences using the initialized backend
    for i in range(3):
        exec_result_stream = loaded_model_dlc(inputs=generate_input_data())
        print(exec_result_stream)
    # Clean up backend resources
    loaded_model_dlc.destroy()
    Copy to clipboard

## Next Steps

Note

This tutorial demonstrated **native execution** where the Python script runs directly on the OE-Linux device.
If you want to control inference remotely from a host machine, see the [Mobilenet V2 Remote Inference on OE-Linux](https://docs.qualcomm.com/doc/80-87189-2/topic/remote_inference.html#remote-inference-oelinux) tutorial.

Last Published: May 26, 2026

[Previous Topic
Mobilenet V2 Remote Inference on OE-Linux](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/remote_inference.md) [Next Topic
Gen AI API Tutorials](https://docs.qualcomm.com/bundle/publicresource/80-87189-2/topics/tutorials.md)