VBackground-1

# Deploy a model using ONNX runtime The following options are available to deploy the model, progressing from the quickest and most straightforward option to the most time-consuming and complex option. - Run an ONNX model inside an Ubuntu Docker using Python - Run a C++ app with ORT ## Run an ONNX model inside an Ubuntu Docker using Python The following image shows the high-level process to run an ONNX model inside an Ubuntu Docker. Run ONNX model inside an Ubuntu Docker ### Compile and validate the ORT Python package 1. Build, load, and run the Qualcomm IM SDK Docker image. See [Build a Qualcomm IM SDK Docker image](https://docs.qualcomm.com/doc/80-80022-50/topic/build-and-run-qualcomm-im-sdk-docker-image.html), for more details. Note Run Step 1 outside the Docker environment. Note Build, load, and run the Docker image as the `root` user. 2. Install the required system packages. Note Run Step 2 and all following steps inside the Docker environment. apt update Copy to clipboard apt install git wget unzip cmake build-essential Copy to clipboard 3. Setup the Conda environment. wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh Copy to clipboard bash Miniconda3-latest-Linux-aarch64.sh Copy to clipboard source ~/.bashrc Copy to clipboard 4. Install Python 3.10. conda create -n py310 python=3.10.9 Copy to clipboard conda activate py310 Copy to clipboard python --version Copy to clipboard pip install numpy pybind11 cmake packaging Copy to clipboard 5. Download the Qualcomm AI Runtime SDK (QAIRT). Note To build ONNX runtime with QNN execution provider from source, you must download Qualcomm AI Engine Direct (QNN), which is part of the QAIRT. wget https://softwarecenter.qualcomm.com/api/download/software/sdks/Qualcomm_AI_Runtime_Community/All/2.43.0.260128/v2.43.0.260128.zip && unzip v2.43.0.260128.zip Copy to clipboard cd qairt/2.43.0.260128/bin Copy to clipboard bash check-linux-dependency.sh Copy to clipboard ./check-python-dependency Copy to clipboard export PATH=$PATH:/root/miniconda3/envs/py310/lib/python3.10/site-packages/cmake/data/bin/ Copy to clipboard 6. Compile ORT with QNN execution provider. cd /home/qimsdk Copy to clipboard git clone --recursive https://github.com/microsoft/onnxruntime Copy to clipboard cd onnxruntime/tools/ci_build/ Copy to clipboard git checkout 2145c8c0d7fedc0a8be65590ddce0fdc3e44d9b7 Copy to clipboard python build.py --use_qnn \ --qnn_home=/home/qimsdk/qairt/2.43.0.260128/ \ --build_wheel \ --skip_submodule_sync \ --config Release \ --build_dir /home/qimsdk/onnxruntime/build/ \ --allow_running_as_root \ --parallel 6 \ --skip_tests Copy to clipboard `onnxruntime_perf_test` and other libraries are generated at `/home/qimsdk/onnxruntime/build/Release/` ### Validate the ORT build using `onnxruntime_perf_test` `onnxruntime_perf_test` is a command-line benchmarking tool provided by ONNX runtime to measure model inference performance. It evaluates latency, throughput, and execution characteristics across different execution providers (such as CPU or QNN). Download the pre-compiled model using the following commands. cd /home/qimsdk/ Copy to clipboard wget https://huggingface.co/qualcomm/Inception-v3/resolve/v0.45.0/Inception-v3_w8a8.onnx.zip && unzip Inception-v3_w8a8.onnx.zip Copy to clipboard cd /home/qimsdk/onnxruntime/build/Release/ Copy to clipboard Use the `onnxruntime_perf_test` CLI tool to inference a model. ./onnxruntime_perf_test -e qnn -m times -r 1 -p burst -I \ -i 'htp_graph_finalization_optimization_mode|3 htp_performance_mode|burst enable_htp_fp16_precision|1 backend_path|libQnnHtp.so' \ /home/qimsdk/job_jpy60qzl5_optimized_onnx/model.onnx Copy to clipboard ### Install the ORT package 1. [Complete the prerequisites](https://docs.qualcomm.com/doc/80-80022-15B/topic/onnx-prepare-model.html#ort-prerequisites) and [compile the ORT Python package](https://docs.qualcomm.com/doc/80-80022-15B/topic/onnx-deploy-model.html#compile-ort-python-package) to setup your environment to run models with Python. 2. After successful compilation, install the wheel file generated in `/home/qimsdk/onnxruntime/build/Release/dist` using the following command. pip install /home/qimsdk/onnxruntime/build/Release/dist/onnxruntime_qnn-1.25.0-cp310-cp310-linux_aarch64.whl Copy to clipboard Note The version may change. Check your build folder to find the correct wheel file. ### Deploy the model to the target device #### Run a pre-complied model from AI Hub 1. Download a a pre-compiled model using the following commands. cd /home/qimsdk/ Copy to clipboard wget https://huggingface.co/qualcomm/Inception-v3/resolve/v0.45.0/Inception-v3_w8a8.onnx.zip && unzip Inception-v3_w8a8.onnx.zip Copy to clipboard 2. Run the pre-compiled model downloaded from AI Hub, by saving the following Python script as `run_ai_hub_model.py` and running it using Python. import onnxruntime import numpy as np import time options = onnxruntime.SessionOptions() # (Optional) Enable configuration that raises an exception if the model can't be # run entirely on the QNN HTP backend. options.add_session_config_entry("session.disable_cpu_ep_fallback", "1") # Create an ONNX Runtime session. session_htp = onnxruntime.InferenceSession("/home/qimsdk/job_jpy60qzl5_optimized_onnx/model.onnx", sess_options=options,providers=["QNNExecutionProvider"], provider_options=[{"backend_path": "/usr/lib/libQnnHtp.so", "enable_htp_fp16_precision":"0"}]) # Provide path to Htp so in QNN SDK # Run the model with your input. input0 = np.ones((1,3,224,224), dtype=np.uint8) htp_start = time.time() result = session_htp.run(None, {"image_tensor": input0}) htp_end = time.time() # Print output. #print(result) # Create an ONNX Runtime session. options.add_session_config_entry("session.disable_cpu_ep_fallback", "0") session_cpu = onnxruntime.InferenceSession("/home/qimsdk/job_jpy60qzl5_optimized_onnx/model.onnx", sess_options=options,providers=["QNNExecutionProvider"], provider_options=[{"backend_path": "/usr/lib/libQnnCpu.so", "enable_htp_fp16_precision":"0"}]) # Provide path to Htp so in QNN SDK # Run the model with your input. cpu_start = time.time() result = session_cpu.run(None, {"image_tensor": input0}) cpu_end = time.time() print(f"CPU time = ", cpu_end - cpu_start) print(f"HTP time = ", htp_end - htp_start) Copy to clipboard #### Run a quantized ONNX model generated with QNN 1. Follow the instructions to [prepare an ONNX model using Qualcomm AI Runtime SDK](https://docs.qualcomm.com/doc/80-80022-15B/topic/onnx-prepare-model.html#prepare-an-onnx-model-using-qualcomm-ai-runtime-sdk). 2. Open a terminal on the Ubuntu host computer and SSH into the target device. 3. Copy the model file (prepared in Step 1) from the host computer to the Docker container using the following `docker cp` command: docker cp /opt/inception_v3_quantized_net_qnn_ctx.onnx qimsdk:/home/qimsdk/ Copy to clipboard 4. Run the model prepared with QNN by saving the following Python script as `run_qnn_ctx.py` and running it using Python. import onnxruntime import numpy as np import time options = onnxruntime.SessionOptions() # (Optional) Enable configuration that raises an exception if the model can't be # run entirely on the QNN HTP backend. options.add_session_config_entry("session.disable_cpu_ep_fallback", "1") # Create an ONNX Runtime session. session_htp = onnxruntime.InferenceSession("/home/qimsdk/inception_v3_quantized_net_qnn_ctx.onnx", sess_options=options,providers=["QNNExecutionProvider"], provider_options=[{"backend_path": "/usr/lib/libQnnHtp.so", "enable_htp_fp16_precision":"0"}]) # Provide path to Htp so in QNN SDK # Run the model with your input. input0 = np.ones((1,224,224,3), dtype=np.uint8) htp_start = time.time() result = session_htp.run(None, {"image_tensor": input0}) htp_end = time.time() print(f"HTP time = ", htp_end - htp_start) Copy to clipboard ## Run a C++ app with ORT The following figure shows the high-level process to build and run C++ ONNX runtime sample applications for Yocto Linux. Run ONNX model inside an Ubuntu Docker ### Install and set up the eSDK 1. [Download and install the Platform eSDK](https://docs.qualcomm.com/doc/80-80022-51/topic/install-sdk.html). The eSDK provides the required cross compiler toolchain. Note For Yocto Scarthgap devices, the libraries are built using GCC 11.2. 2. Go to the directory where you extracted the eSDK. cd Copy to clipboard 3. Set the workspace path. export WORKSPACE="$PWD" Copy to clipboard 4. Source the eSDK environment setup script to configure the cross compilation environment: source environment-setup-armv8-2a-qcom-linux Copy to clipboard ### Download and compile the C++ sample application 1. Clone the sample repository. git clone https://github.com/quic/sample-apps-for-qualcomm-linux.git && cd sample-apps-for-qualcomm-linux/qualcomm-linux/applications/ort_example/ Copy to clipboard 2. Set the required build environment variables. export SDKTARGETSYSROOT=$WORKSPACE/tmp/sysroots/ Copy to clipboard export MACHINE=qcs6490-rb3gen2-vision-kit Copy to clipboard Note Update the machine name based on your platform (RB3 Gen 2, IQ8, IQ9) After installing the eSDK, the machine name can be found in the `SDKTARGETSYSROOT` directory. 3. Build the application. make Copy to clipboard This generates the `onnxruntime-example` executable. ### Deploy the application to the target device 1. Copy the binary to the target device: scp onnxruntime-example root@:/opt/ Copy to clipboard 2. Sign in to the target device using SSH. ssh root@ Copy to clipboard ### Download the model and run inference 1. Download the quantized ONNX model. wget https://huggingface.co/qualcomm/Inception-v3/resolve/v0.45.0/Inception-v3_w8a8.onnx.zip Copy to clipboard unzip Inception-v3_w8a8.onnx.zip Copy to clipboard mv job_jpy60qzl5_optimized_onnx/* . Copy to clipboard 2. Prepare the `input.raw` file required for inference. In a Python shell, run the following commands to generate `input.raw`. import numpy as np Copy to clipboard ((np.random.random((1,3,224,224)).astype(np.float32))).tofile("input.raw") Copy to clipboard 3. Run inference using the HTP backend: ./onnxruntime-example --htp model.onnx input.raw Copy to clipboard Last Published: Jun 23, 2026 [Previous Topic Prepare ONNX models](https://docs.qualcomm.com/bundle/publicresource/80-80022-15B/topics/onnx-prepare-model.md) [Next Topic Develop your own AI/ML application](https://docs.qualcomm.com/bundle/publicresource/80-80022-15B/topics/develop-your-own-application.md)