# Deploy a model using ONNX runtime

The following options are available to deploy the model,
progressing from the quickest and most straightforward option to the most
time-consuming and complex option.

- Run an ONNX model inside an Ubuntu Docker using Python

- Run a C++ app with ORT

## Run an ONNX model inside an Ubuntu Docker using Python

The following image shows the high-level process to run an ONNX model inside
an Ubuntu Docker.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<!-- Generated by Microsoft Visio, SVG Export run-onnx-model-in-ubuntu-docker.svg Page-1 -->
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:ev="http://www.w3.org/2001/xml-events" width="10.0139in" height="1.86458in" viewbox="0 0 721 134.25" xml:space="preserve" color-interpolation-filters="sRGB" class="st8" aria-label="../_images/run-onnx-model-in-ubuntu-docker.svg" svgdefaultwidth="50%"><style>.svg-1 .st1 { fill: #ffffff; stroke: none; stroke-linecap: round; stroke-linejoin: round; stroke-width: 0.75 }
.svg-1 .st2 { fill: #fafafa; stroke: #d2d7e1; stroke-linecap: round; stroke-linejoin: round; stroke-width: 0.75 }
.svg-1 .st3 { fill: none; stroke: #d2d7e1; stroke-linecap: round; stroke-linejoin: round; stroke-width: 1 }
.svg-1 .st4 { fill: #000000; font-family: Roboto; font-size: 0.833336em }
.svg-1 .st5 { font-size: 1em }
.svg-1 .st6 { marker-end: url("#1-mrkr4-27"); stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-width: 0.75 }
.svg-1 .st7 { fill: #000000; fill-opacity: 1; stroke: #000000; stroke-opacity: 1; stroke-width: 0.29411764705882 }
.svg-1 .st8 { fill: none; fill-rule: evenodd; font-size: 12px; overflow: visible; stroke-linecap: square; stroke-miterlimit: 3 }</style>
<defs id="Markers">	<g id="lend4">		<path d="M 2 1 L 0 0 L 2 -1 L 2 1 " style="stroke:none"></path>	</g>	<marker id="1-mrkr4-27" class="st7" refx="-6.8" orient="auto" markerunits="strokeWidth" overflow="visible">		<use xlink:href="#lend4" transform="scale(-3.4,-3.4) "></use>	</marker></defs><g>	<title>VBackground-1</title>	<g id="shape1-1">		<title>Solid</title>		<rect x="0" y="0" width="721" height="134.25" class="st1"></rect>	</g></g><g>	<title>Page-1</title>	<g id="shape1003-3" transform="translate(18.5,-18.5)">		<title>Rectangle.10</title>		<rect x="0" y="37" width="684" height="97.25" rx="3.6" ry="3.6" class="st2"></rect>	</g>	<g id="shape1009-5" transform="translate(31.5,-35.875)">		<title>Rectangle.20</title>		<desc>Compile the ORT Python package</desc>		<rect x="0" y="68.1492" width="99" height="66.1008" rx="2.88" ry="2.88" class="st3"></rect>		<text x="12.12" y="98.2" class="st4">Compile the ORT <tspan x="13.89" dy="1.2em" class="st5">Python package</tspan></text>		</g>	<g id="shape1019-9" transform="translate(218.25,-35.875)">		<title>Rectangle.1019</title>		<desc>Validate the ORT Python package</desc>		<rect x="0" y="68.1492" width="99" height="66.1008" rx="2.88" ry="2.88" class="st3"></rect>		<text x="12.34" y="98.3" class="st4">Validate the ORT <tspan x="13.89" dy="1.16em" class="st5">Python package</tspan></text>		</g>	<g id="shape1020-13" transform="translate(405,-35.875)">		<title>Rectangle.1020</title>		<desc>Install the ORT Python package</desc>		<rect x="0" y="68.1492" width="99" height="66.1008" rx="2.88" ry="2.88" class="st3"></rect>		<text x="17.01" y="98.3" class="st4">Install the ORT <tspan x="13.89" dy="1.16em" class="st5">Python package</tspan></text>		</g>	<g id="shape1021-17" transform="translate(591.75,-35.875)">		<title>Rectangle.1021</title>		<desc>Deploy the model to the target device</desc>		<rect x="0" y="68.1492" width="99" height="66.1008" rx="2.88" ry="2.88" class="st3"></rect>		<text x="10.91" y="92.2" class="st4">Deploy the model <tspan x="36.74" dy="1.2em" class="st5">to the </tspan><tspan x="20.76" dy="1.2em" class="st5">target device</tspan></text>		</g>	<g id="shape1022-22" transform="translate(130.5,-68.9254)">		<title>Sheet.1022</title>		<path d="M0 134.25 L82.65 134.25" class="st6"></path>	</g>	<g id="shape1023-28" transform="translate(317.25,-68.9254)">		<title>Sheet.1023</title>		<path d="M0 134.25 L82.65 134.25" class="st6"></path>	</g>	<g id="shape1024-33" transform="translate(504,-68.9254)">		<title>Sheet.1024</title>		<path d="M0 134.25 L82.65 134.25" class="st6"></path>	</g></g>
</svg>

Run ONNX model inside an Ubuntu Docker

### Compile and validate the ORT Python package

1. Build, load, and run the Qualcomm IM SDK Docker image. See [Build a Qualcomm IM SDK Docker image](https://docs.qualcomm.com/doc/80-80022-50/topic/build-and-run-qualcomm-im-sdk-docker-image.html),
for more details.

Note

Run Step 1 outside the Docker environment.

Note

Build, load, and run the Docker image as the `root` user.
2. Install the required system packages.

Note

Run Step 2 and all following steps inside the Docker environment.

apt update
        Copy to clipboard

apt install git wget unzip cmake build-essential
        Copy to clipboard
3. Setup the Conda environment.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh
        Copy to clipboard

bash Miniconda3-latest-Linux-aarch64.sh
        Copy to clipboard

source ~/.bashrc
        Copy to clipboard
4. Install Python 3.10.

conda create -n py310 python=3.10.9
        Copy to clipboard

conda activate py310
        Copy to clipboard

python --version
        Copy to clipboard

pip install numpy pybind11 cmake packaging
        Copy to clipboard
5. Download the Qualcomm AI Runtime SDK (QAIRT).

Note

To build ONNX runtime with QNN execution provider from source,
you must download Qualcomm AI Engine Direct (QNN), which is part of
the QAIRT.

wget https://softwarecenter.qualcomm.com/api/download/software/sdks/Qualcomm_AI_Runtime_Community/All/2.43.0.260128/v2.43.0.260128.zip && unzip v2.43.0.260128.zip
        Copy to clipboard

cd qairt/2.43.0.260128/bin
        Copy to clipboard

bash check-linux-dependency.sh
        Copy to clipboard

./check-python-dependency
        Copy to clipboard

export PATH=$PATH:/root/miniconda3/envs/py310/lib/python3.10/site-packages/cmake/data/bin/
        Copy to clipboard
6. Compile ORT with QNN execution provider.

cd /home/qimsdk
        Copy to clipboard

git clone --recursive https://github.com/microsoft/onnxruntime
        Copy to clipboard

cd onnxruntime/tools/ci_build/
        Copy to clipboard

git checkout 2145c8c0d7fedc0a8be65590ddce0fdc3e44d9b7
        Copy to clipboard

python build.py --use_qnn \
                        --qnn_home=/home/qimsdk/qairt/2.43.0.260128/ \
                        --build_wheel \
                        --skip_submodule_sync \
                        --config Release \
                        --build_dir /home/qimsdk/onnxruntime/build/ \
                        --allow_running_as_root \
                        --parallel 6 \
                        --skip_tests
        Copy to clipboard

    `onnxruntime_perf_test` and other libraries are generated at `/home/qimsdk/onnxruntime/build/Release/`

### Validate the ORT build using `onnxruntime_perf_test`

`onnxruntime_perf_test` is a command-line benchmarking tool provided by ONNX runtime to measure
model inference performance. It evaluates latency, throughput, and execution characteristics across
different execution providers (such as CPU or QNN).

Download the pre-compiled model using the following commands.

cd /home/qimsdk/
    Copy to clipboard

wget https://huggingface.co/qualcomm/Inception-v3/resolve/v0.45.0/Inception-v3_w8a8.onnx.zip && unzip Inception-v3_w8a8.onnx.zip
    Copy to clipboard

cd  /home/qimsdk/onnxruntime/build/Release/
    Copy to clipboard

Use the `onnxruntime_perf_test` CLI tool to inference a model.

./onnxruntime_perf_test -e qnn -m times -r 1 -p burst -I \
                            -i 'htp_graph_finalization_optimization_mode|3 htp_performance_mode|burst enable_htp_fp16_precision|1 backend_path|libQnnHtp.so' \
                            /home/qimsdk/job_jpy60qzl5_optimized_onnx/model.onnx
    Copy to clipboard

### Install the ORT package

1. [Complete the prerequisites](https://docs.qualcomm.com/doc/80-80022-15B/topic/onnx-prepare-model.html#ort-prerequisites) and [compile the ORT Python package](https://docs.qualcomm.com/doc/80-80022-15B/topic/onnx-deploy-model.html#compile-ort-python-package)
to setup your environment to run models with Python.
2. After successful compilation, install the wheel file generated in
`/home/qimsdk/onnxruntime/build/Release/dist` using the following command.

pip install /home/qimsdk/onnxruntime/build/Release/dist/onnxruntime_qnn-1.25.0-cp310-cp310-linux_aarch64.whl
        Copy to clipboard

Note

The version may change. Check your build folder to find the correct wheel file.

### Deploy the model to the target device

#### Run a pre-complied model from AI Hub

1. Download a a pre-compiled model using the following commands.

cd /home/qimsdk/
        Copy to clipboard

wget https://huggingface.co/qualcomm/Inception-v3/resolve/v0.45.0/Inception-v3_w8a8.onnx.zip && unzip Inception-v3_w8a8.onnx.zip
        Copy to clipboard
2. Run the pre-compiled model downloaded from AI Hub, by saving the following Python script as
`run_ai_hub_model.py` and running it using Python.

import onnxruntime
        import numpy as np
        import time
        
        options = onnxruntime.SessionOptions()
        
        # (Optional) Enable configuration that raises an exception if the model can't be
        # run entirely on the QNN HTP backend.
        options.add_session_config_entry("session.disable_cpu_ep_fallback", "1")
        
        # Create an ONNX Runtime session.
        session_htp = onnxruntime.InferenceSession("/home/qimsdk/job_jpy60qzl5_optimized_onnx/model.onnx",
                                              sess_options=options,providers=["QNNExecutionProvider"],
                                              provider_options=[{"backend_path": "/usr/lib/libQnnHtp.so",
                                              "enable_htp_fp16_precision":"0"}]) # Provide path to Htp so in QNN SDK
        
        # Run the model with your input.
        input0 = np.ones((1,3,224,224), dtype=np.uint8)
        htp_start = time.time()
        result = session_htp.run(None, {"image_tensor": input0})
        htp_end  = time.time()
        # Print output.
        #print(result)
        
        # Create an ONNX Runtime session.
        options.add_session_config_entry("session.disable_cpu_ep_fallback", "0")
        session_cpu = onnxruntime.InferenceSession("/home/qimsdk/job_jpy60qzl5_optimized_onnx/model.onnx",
                                              sess_options=options,providers=["QNNExecutionProvider"],
                                              provider_options=[{"backend_path": "/usr/lib/libQnnCpu.so",
                                              "enable_htp_fp16_precision":"0"}]) # Provide path to Htp so in QNN SDK
        
        # Run the model with your input.
        cpu_start = time.time()
        result = session_cpu.run(None, {"image_tensor": input0})
        cpu_end = time.time()
        
        print(f"CPU time = ", cpu_end - cpu_start)
        print(f"HTP time = ", htp_end - htp_start)
        Copy to clipboard

#### Run a quantized ONNX model generated with QNN

1. Follow the instructions to [prepare an ONNX model using Qualcomm AI Runtime SDK](https://docs.qualcomm.com/doc/80-80022-15B/topic/onnx-prepare-model.html#prepare-an-onnx-model-using-qualcomm-ai-runtime-sdk).
2. Open a terminal on the Ubuntu host computer and SSH into the target device.
3. Copy the model file (prepared in Step 1) from the host computer to the Docker
container using the following `docker cp` command:

docker cp /opt/inception_v3_quantized_net_qnn_ctx.onnx qimsdk:/home/qimsdk/
        Copy to clipboard
4. Run the model prepared with QNN by saving the following Python script as
`run_qnn_ctx.py` and running it using Python.

import onnxruntime
        import numpy as np
        import time
        
        options = onnxruntime.SessionOptions()
        
        # (Optional) Enable configuration that raises an exception if the model can't be
        # run entirely on the QNN HTP backend.
        options.add_session_config_entry("session.disable_cpu_ep_fallback", "1")
        
        # Create an ONNX Runtime session.
        session_htp = onnxruntime.InferenceSession("/home/qimsdk/inception_v3_quantized_net_qnn_ctx.onnx",
                                              sess_options=options,providers=["QNNExecutionProvider"],
                                              provider_options=[{"backend_path": "/usr/lib/libQnnHtp.so",
                                              "enable_htp_fp16_precision":"0"}]) # Provide path to Htp so in QNN SDK
        
        # Run the model with your input.
        input0 = np.ones((1,224,224,3), dtype=np.uint8)
        htp_start = time.time()
        result = session_htp.run(None, {"image_tensor": input0})
        htp_end = time.time()
        print(f"HTP time = ", htp_end - htp_start)
        Copy to clipboard

## Run a C++ app with ORT

The following figure shows the high-level process to build and run C++ ONNX runtime sample
applications for Yocto Linux.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<!-- Generated by Microsoft Visio, SVG Export run-cpp-app-with-ort.svg Page-1 -->
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:ev="http://www.w3.org/2001/xml-events" width="10.0139in" height="1.86458in" viewbox="0 0 721 134.25" xml:space="preserve" color-interpolation-filters="sRGB" class="st10" aria-label="../_images/run-cpp-app-with-ort.svg" svgdefaultwidth="50%"><style>.svg-2 .st1 { fill: #ffffff; stroke: none; stroke-linecap: round; stroke-linejoin: round; stroke-width: 0.75 }
.svg-2 .st2 { fill: #fafafa; stroke: #d2d7e1; stroke-linecap: round; stroke-linejoin: round; stroke-width: 0.75 }
.svg-2 .st3 { fill: none; stroke: #d2d7e1; stroke-linecap: round; stroke-linejoin: round; stroke-width: 1 }
.svg-2 .st4 { fill: #000000; font-family: Roboto; font-size: 0.833336em }
.svg-2 .st5 { font-size: 1em; letter-spacing: -0.0249999em }
.svg-2 .st6 { font-size: 1em; letter-spacing: -0.0199999em }
.svg-2 .st7 { font-size: 1em }
.svg-2 .st8 { marker-end: url("#2-mrkr4-35"); stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-width: 0.75 }
.svg-2 .st9 { fill: #000000; fill-opacity: 1; stroke: #000000; stroke-opacity: 1; stroke-width: 0.29411764705882 }
.svg-2 .st10 { fill: none; fill-rule: evenodd; font-size: 12px; overflow: visible; stroke-linecap: square; stroke-miterlimit: 3 }</style>
<defs id="Markers">	<g id="lend4">		<path d="M 2 1 L 0 0 L 2 -1 L 2 1 " style="stroke:none"></path>	</g>	<marker id="2-mrkr4-35" class="st9" refx="-6.8" orient="auto" markerunits="strokeWidth" overflow="visible">		<use xlink:href="#lend4" transform="scale(-3.4,-3.4) "></use>	</marker></defs><g>	<title>VBackground-1</title>	<g id="shape1-1">		<title>Solid</title>		<rect x="0" y="0" width="721" height="134.25" class="st1"></rect>	</g></g><g>	<title>Page-1</title>	<g id="shape1003-3" transform="translate(18.5,-18.5)">		<title>Rectangle.10</title>		<rect x="0" y="37" width="684" height="97.25" rx="3.6" ry="3.6" class="st2"></rect>	</g>	<g id="shape1009-5" transform="translate(31.5,-35.875)">		<title>Rectangle.20</title>		<desc>Install and set up the eSDK</desc>		<rect x="0" y="68.1492" width="99" height="66.1008" rx="2.88" ry="2.88" class="st3"></rect>		<text x="11.94" y="98.2" class="st4">Install<tspan class="st5"> </tspan>and<tspan class="st6"> </tspan>set<tspan class="st5"> </tspan>up<tspan class="st6"> </tspan><tspan x="29.7" dy="1.2em" class="st7">the</tspan><tspan class="st6"> </tspan><tspan class="st6">eSDK</tspan></text>		</g>	<g id="shape1019-15" transform="translate(218.25,-35.875)">		<title>Rectangle.1019</title>		<desc>Download and compile the C++ sample application</desc>		<rect x="0" y="68.1492" width="99" height="66.1008" rx="2.88" ry="2.88" class="st3"></rect>		<text x="17.71" y="92.5" class="st4">Download and <tspan x="13.33" dy="1.16em" class="st7">compile the C++ </tspan><tspan x="7.36" dy="1.16em" class="st7">sample application</tspan></text>		</g>	<g id="shape1020-20" transform="translate(405,-35.875)">		<title>Rectangle.1020</title>		<desc>Deploy the application to the target device</desc>		<rect x="0" y="68.1492" width="99" height="66.1008" rx="2.88" ry="2.88" class="st3"></rect>		<text x="26.06" y="92.5" class="st4">Deploy the <tspan x="10.95" dy="1.16em" class="st7">application to the </tspan><tspan x="20.76" dy="1.16em" class="st7">target device</tspan></text>		</g>	<g id="shape1021-25" transform="translate(591.75,-35.875)">		<title>Rectangle.1021</title>		<desc>Download the model and run inference</desc>		<rect x="0" y="68.1492" width="99" height="66.1008" rx="2.88" ry="2.88" class="st3"></rect>		<text x="18.98" y="92.2" class="st4">Download the <tspan x="17.6" dy="1.2em" class="st7">model and run </tspan><tspan x="28.78" dy="1.2em" class="st7">inference</tspan></text>		</g>	<g id="shape1022-30" transform="translate(130.5,-68.9254)">		<title>Sheet.1022</title>		<path d="M0 134.25 L82.65 134.25" class="st8"></path>	</g>	<g id="shape1023-36" transform="translate(317.25,-68.9254)">		<title>Sheet.1023</title>		<path d="M0 134.25 L82.65 134.25" class="st8"></path>	</g>	<g id="shape1024-41" transform="translate(504,-68.9254)">		<title>Sheet.1024</title>		<path d="M0 134.25 L82.65 134.25" class="st8"></path>	</g></g>
</svg>

Run ONNX model inside an Ubuntu Docker

### Install and set up the eSDK

1. [Download and install the Platform eSDK](https://docs.qualcomm.com/doc/80-80022-51/topic/install-sdk.html).

    The eSDK provides the required cross compiler toolchain.

Note

For Yocto Scarthgap devices, the libraries are built using GCC 11.2.
2. Go to the directory where you extracted the eSDK.

cd <PATH/TO/EXTRACTED/TOOLCHAIN>
        Copy to clipboard
3. Set the workspace path.

export WORKSPACE="$PWD"
        Copy to clipboard
4. Source the eSDK environment setup script to configure the cross compilation environment:

source environment-setup-armv8-2a-qcom-linux
        Copy to clipboard

### Download and compile the C++ sample application

1. Clone the sample repository.

git clone https://github.com/quic/sample-apps-for-qualcomm-linux.git && cd sample-apps-for-qualcomm-linux/qualcomm-linux/applications/ort_example/
        Copy to clipboard
2. Set the required build environment variables.

export SDKTARGETSYSROOT=$WORKSPACE/tmp/sysroots/
        Copy to clipboard

export MACHINE=qcs6490-rb3gen2-vision-kit
        Copy to clipboard

Note

Update the machine name based on your platform (RB3 Gen 2, IQ8, IQ9)

After installing the eSDK, the machine name can be found in the `SDKTARGETSYSROOT`
directory.
3. Build the application.

make
        Copy to clipboard

    This generates the `onnxruntime-example` executable.

### Deploy the application to the target device

1. Copy the binary to the target device:

scp onnxruntime-example root@<IP_ADDRESS>:/opt/
        Copy to clipboard
2. Sign in to the target device using SSH.

ssh root@<IP_ADDRESS>
        Copy to clipboard

### Download the model and run inference

1. Download the quantized ONNX model.

wget https://huggingface.co/qualcomm/Inception-v3/resolve/v0.45.0/Inception-v3_w8a8.onnx.zip
        Copy to clipboard

unzip Inception-v3_w8a8.onnx.zip
        Copy to clipboard

mv job_jpy60qzl5_optimized_onnx/* .
        Copy to clipboard
2. Prepare the `input.raw` file required for inference.

    In a Python shell, run the following commands to generate `input.raw`.

import numpy as np
        Copy to clipboard

((np.random.random((1,3,224,224)).astype(np.float32))).tofile("input.raw")
        Copy to clipboard
3. Run inference using the HTP backend:

./onnxruntime-example --htp model.onnx input.raw
        Copy to clipboard

Last Published: May 14, 2026

[Previous Topic
Prepare ONNX models](https://docs.qualcomm.com/bundle/publicresource/80-80022-15B/topics/onnx-prepare-model.md) [Next Topic
Develop your own AI/ML application](https://docs.qualcomm.com/bundle/publicresource/80-80022-15B/topics/develop-your-own-application.md)