# Prepare ONNX models

There are 2 options for model preparation:

- Download a pre-compiled model from Qualcomm AI Hub.
- Prepare the model using the Qualcomm AI Runtime SDK (QAIRT) (advanced workflow)

For a quick out-of-the-box experience, download the model from AI Hub.
The following documentation uses the AI Hub InceptionV3 model for deployment.

## Prerequisites

1. Sign in with SSH and connect to the target device. For detailed instructions, see:

    Sign in using [SSH](https://docs.qualcomm.com/doc/80-80022-254/topic/how_to.html#use-ssh) for Qualcomm Linux.

Note

If SSH is already set up and Wi-Fi is connected, skip this step.
2. Sign in to the target device using SSH:

ssh root@<IP_ADDRESS_OF_THE_TARGET_DEVICE>
        Copy to clipboard

## Download a pre-compiled model from AI Hub

Qualcomm AI Hub hosts pre-compiled and quantized ONNX models that can be downloaded
and used with the ONNX runtime with the
[QNN execution provider (QNN EP)](https://docs.qualcomm.com/doc/80-62010-1/topic/ort-qnn-ep.html#ort-qnn-ep).

For more details, see [Qualcomm AI hub](https://aihub.qualcomm.com/iot/models/inception_v3).

## Prepare a quantized ONNX model using Qualcomm AI Runtime SDK

Run the following steps on your Ubuntu host computer to prepare the quantized ONNX model.

1. Install the QAIRT SDK.
2. Download an FP32 model from AI Hub.
3. Compile and quantize the model.
4. Generate the quantized ONNX model from the QNN context binary.

### Install the QAIRT SDK

See [Install Qualcomm AI Runtime SDK](https://docs.qualcomm.com/doc/80-80022-15B/topic/qairt-install.html).

### Download an FP32 model from AI Hub

1. Create an `ort_workspace` working directory.

mkdir ~/ort_workspace
        Copy to clipboard
2. Go to the ort\_workspace `directory`.

cd ~/ort_workspace/
        Copy to clipboard
3. Download the Inception V3 model from AI Hub.

wget https://huggingface.co/qualcomm/Inception-v3/resolve/v0.45.0/Inception-v3_float.onnx.zip && unzip Inception-v3_float.onnx.zip
        Copy to clipboard

### Compile and quantize the model

1. Prepare a representative data set

    Model quantization requires a representative data set.

    Copy and run the following Python script to generate `random.raw` input files for quantization.

import os
        import numpy as np
        
        input_path_list = []
        BASE_PATH = '/tmp/RandomInputsForInceptionV3'
        
        if not os.path.exists(BASE_PATH):
            os.mkdir(BASE_PATH)
        
        NUM_IMAGES = 10
        for img in range(NUM_IMAGES):
            filename = f'input_{img}.raw'
            randomTensor = np.random.random((1, 224, 224, 3)).astype(np.float32)
            filepath = os.path.join(BASE_PATH, filename)
            randomTensor.tofile(filepath)
            input_path_list.append(filepath)
        
        with open('input_list.txt', 'w') as f:
            for path in input_path_list:
                f.write(path + '\n')
        Copy to clipboard

    The above script generates 10 sample `.raw` files in `/tmp/RandomInputsForInceptionV3/`.

    `input_list.txt` contains paths to all generated inputs.
2. Convert and quantize the model using `qnn-onnx-converter`

    Pass a pretrained FP32 model—whether exported from PyTorch, ONNX, TensorFlow, or LiteRT—to the
appropriate QNN converter tool (`qnn-<FRAMEWORK>-converter`). The converter translates the model into
a QNN graph expressed as a high-level, human-readable C++ representation.

Note

For execution on the HTP, the model must be quantized.

Perform quantization at the same time as conversion and provide a calibration dataset.

Calibration data is used to carry out static quantization, which allows the model to be optimized for
HTP acceleration.

${QAIRT_SDK_ROOT}/bin/x86_64-linux-clang/qnn-onnx-converter --input_network ~/ort_workspace/job_j5m3yoe9g_optimized_onnx/model.onnx \
                                                                    --output_path ~/ort_workspace/inception_v3_quantized.cpp \
                                                                    --input_list ~/ort_workspace/input_list.txt \
                                                                    --input_dim "image_tensor" 1,3,224,224
        Copy to clipboard
3. Generate shared object (.so)

    Run the command to compile the C++ graph generated during conversion/quantization into a shared object (.so).

${QAIRT_SDK_ROOT}/bin/x86_64-linux-clang/qnn-model-lib-generator -c ~/ort_workspace/inception_v3_quantized.cpp \
                                                                         -b ~/ort_workspace/inception_v3_quantized.bin \
                                                                         -o ~/ort_workspace/libs/ \
                                                                         -t x86_64-linux-clang
        Copy to clipboard

    See [Qualcomm AI Runtime (QAIRT) SDK - Qualcomm AI Engine Direct](https://docs.qualcomm.com/nav/home/index_QNN.html?product=1601111740009302) for more information.
4. Generate QNN context binary

    To run the model on a target device HTP, you must generate a serialized context binary.

    To generate context specific to a SoC, a backend configuration file and backend extensions configuration file,
specifying details like graph name and VTCM size need to pe passed to `qnn-conext-binary-generator`.

    1. Create a `backend_config.json` backend configuration file.

        The following examples show how to create a backend configuration file (`backend_config.json`) with mandatory options passed.

Tab QCS6490/QCM6490
Tab QCS9100

{
               "graphs": [
                     {
                         "graph_names": [
                             "inception_v3_quantized"
                         ],
                         "vtcm_mb": 2
                     }
                   ],
                   "devices": [
                     {
                         "htp_arch": "v68"
                     }
                     ]
             }
            Copy to clipboard

{
               "graphs": [
                     {
                         "graph_names": [
                              "inception_v3_quantized"
                         ],
                         "vtcm_mb": 8
                     }
               ],
               "devices": [
                     {
                         "htp_arch": "v73"
                     }
               ]
            }
            Copy to clipboard
    2. Create the `backend_extension.json` file with the following contents.

        Modify the configuration with the absolute path to `QAIRT_SDK_ROOT` and `backend_config.json`

{
              "backend_extensions": {
                 "shared_library_path": "$QAIRT_SDK_ROOT/lib/x86_64-linux-clang/libQnnHtpNetRunExtensions.so",
                 "config_file_path": "path_to_config_file - backend_config.json"
                 }
            }
            Copy to clipboard
    3. Create a context binary using `qnn-context-binary-generator`

        To generate the context, update the &lt;PATH\_TO\_JSON\_OF\_BACKEND\_EXTENSIONS&gt; below with the
configuration you wrote above and then run the following command.

"$QNN_SDK_ROOT/bin/x86_64-linux-clang/qnn-context-binary-generator" --backend "${QNN_SDK_ROOT}/lib/x86_64-linux-clang/libQnnHtp.so" \
                                                                                --model ~/ort_workspace/libs/x86_64-linux-clang/libinception_v3_quantized.so \
                                                                                --binary_file libinception_v3_quantized.serialized \
                                                                                --config_file backend_extension.json
            Copy to clipboard

### Generate quantized ONNX from the QNN context binary

The `gen_qnn_ctx_onnx_model.py` utility script embeds an offline QNN context binary into an ONNX model.
It generates a new ONNX file where the QNN context is stored as an initializer and connected through a
custom QNN operator, allowing the ONNX runtime QNN EP to directly load and execute the pregenerated context.

1. Clone the `onnxruntime` repository.

git clone https://github.com/microsoft/onnxruntime.git
        Copy to clipboard
2. Copy the `gen_qnn_ctx_onnx_model.py` utility script from the cloned respository to your working directory.

cp onnxruntime/onnxruntime/python/tools/qnn/gen_qnn_ctx_onnx_model.py .
        Copy to clipboard
3. Remove the local copy of the `onnxruntime` repository.

rm -rf onnxruntime
        Copy to clipboard
4. Generate the model by running the `gen_qnn_ctx_onnx_model.py` utility script.

python gen_qnn_ctx_onnx_model.py -b output/libinception_v3_quantized.serialized.bin \
                                         -q inception_v3_quantized_net.json \
                                         --quantized_IO
        Copy to clipboard

    The model is generated at `~/ort_workspace/inception_v3_quantized_net_qnn_ctx.onnx` and can be used for
inference on HTP with the [ORT QNN EP](https://docs.qualcomm.com/doc/80-62010-1/topic/ort-qnn-ep.html#ort-qnn-ep).

Copy the generated model to the target device using the following command:

scp -r ~/ort_workspace/inception_v3_quantized_net_qnn_ctx.onnx root@<IP-ADDRESS>:/opt/
    Copy to clipboard

Last Published: May 14, 2026

[Previous Topic
Run an ONNX model on NPU using ORT](https://docs.qualcomm.com/bundle/publicresource/80-80022-15B/topics/run-an-onnx-model-using-ort.md) [Next Topic
Deploy a model using ONNX runtime](https://docs.qualcomm.com/bundle/publicresource/80-80022-15B/topics/onnx-deploy-model.md)