# GPU

This section provides information about the QNN GPU backend.

- [API Specializations](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_backend.html#api-specializations)
- [Operation Limitations](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_backend.html#operation-limitations)
- [Kernel Persistence](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_backend.html#kernel-persistence)
- [Precision Mode](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_backend.html#precision-mode)
- [Performance Hints](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_backend.html#performance-hints)
- [Context Configs](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_backend.html#context-configs)
- [Backend Configs](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_backend.html#backend-configs)
- [QnnDevice API Usage](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_backend.html#qnndevice-api-usage)
- [Disabling Optimizations](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_backend.html#disabling-optimizations)
- [QNN GPU Backend Extensions](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_backend.html#qnn-gpu-backend-extensions)
- [Custom Profile Reader](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_backend.html#custom-profile-reader)
- [Op Package Writing Guidelines](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_backend.html#op-package-writing-guidelines)
- [QNN Mem API Tutorial for GPU](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_backend.html#qnn-mem-api-tutorial-for-gpu)
- [Tuning Mode (Beta)](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_backend.html#tuning-mode-beta)
- [Offline Prepare](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_backend.html#offline-prepare)
- [Other Notes](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_backend.html#other-notes)

## API Specializations

This section contains information related to API specialization for the GPU backend. All QNN GPU
backend specialization is available under the `${QNN_SDK_ROOT}/include/QNN/GPU/` directory.

The current version of the QNN GPU backend API is:

Warning

doxygendefine: Cannot find define “QNN\_GPU\_API\_VERSION\_MAJOR” in doxygen xml output for project “QairtCApi” from directory: /local/mnt/workspace/mlg\_user\_admin/ci.docker.tmp/37\_32544/build/x86\_64-linux-clang/FirstParty/QNN/Doc/qairt-api-docs/c-api-docs/xml

Warning

doxygendefine: Cannot find define “QNN\_GPU\_API\_VERSION\_MINOR” in doxygen xml output for project “QairtCApi” from directory: /local/mnt/workspace/mlg\_user\_admin/ci.docker.tmp/37\_32544/build/x86\_64-linux-clang/FirstParty/QNN/Doc/qairt-api-docs/c-api-docs/xml

Warning

doxygendefine: Cannot find define “QNN\_GPU\_API\_VERSION\_PATCH” in doxygen xml output for project “QairtCApi” from directory: /local/mnt/workspace/mlg\_user\_admin/ci.docker.tmp/37\_32544/build/x86\_64-linux-clang/FirstParty/QNN/Doc/qairt-api-docs/c-api-docs/xml

## Operation Limitations

QNN GPU operation limitations are documented in OpDef/GpuOpDefSupplement:GPU Backend Op Definition Supplement.

## Kernel Persistence

The QNN GPU backend supports two kernel persistence strategies held within a QNN Context: in-memory and on-disk.
We refer to the in-memory
persistence as the kernel registry and we refer to the on-disk persistence as the kernel repository. These are two
mechanisms whereby kernels are re-used to reduce model initialization time. The following will outline how to use these
features by introducing a simple use case.

A user creates a new QNN GPU Context by calling
QnnContext\_create with a custom config
setting providing a valid kernelRepoDir. Let’s assume this
path is `${QNN_GPU_KERNEL_REPO}`. Assume that there is no existing on-disk repo corresponding to this path. Therefore,
kernels will not be deserialized and the in-memory registry will contain no kernels. Kernels originating from the
built-in qti.aisw op package will be deserialized during
QnnContext\_create. Kernels originating from
another op package will be deserialized when that op package is registered via
QnnBackend\_registerOpPackage.

A user creates model A and finalizes it. Suppose that
model A comprises of kernels 1, 2, and 3. These kernels are created from scratch and added to the in-memory kernel
registry. A user creates model B and finalizes it. Suppose that model B comprises of kernels 3 and 4. Kernel 3 will be
recovered from the in-memory kernel registry and kernel 4 will be created from scratch and added to the registry.

The user now calls QnnContext\_free. Since
a valid kernel repo path was provided, the QNN GPU Context will serialize in-memory kernels and, for each op package,
write them to `${QNN_GPU_KERNEL_REPO}/gpukernelcache.${OP_PKG_NAME}` where OP\_PKG\_NAME is the op package
packageName.

If the user creates another QNN GPU Context specifying the same kernel repo path, these kernels will be deserialized
as outlined above and added to the in-memory kernel registry. If the user now creates model A or B, all kernels will be
ready for creation via the in-memory registry, greatly reducing initialization time.

Note that an op package provides a
kernelRepoHash to the Context. If the QNN
GPU Context detects that an on-disk kernel repository was generated by an op package of the same name, but with a
different kernelRepoHash, the on-disk repository will be automatically invalidated. This ensures that kernel version
mis-matches do not occur.

Also note that these QNN GPU kernel persistence features are separate from the QNN context cache feature (see
QnnContext\_getBinary). A QNN GPU context cache
will store everything needed to re-create a context, including kernels.

## Precision Mode

The QNN GPU backend offers four precision modes via the QNN graph custom config feature
(see QnnGpuGraph\_CustomConfig\_t and
QnnGpu\_Precision\_t). These modes are:

- QNN\_GPU\_PRECISION\_FP32 (FP32 mode)

> 
> 
> - FP32 mode will convert NATIVE tensor data types to FP32 and will select kernels that use an FP32 accumulator.
>     - FP32 mode offers the best accuracy at the expense of performance.
- QNN\_GPU\_PRECISION\_FP16 (FP16 mode)

> 
> 
> - FP16 mode will convert NATIVE tensor data types to FP16 and will select kernels that use an FP16 accumulator
> where possible.
>     - FP16 mode offers the best performance at the expense of accuracy.
- QNN\_GPU\_PRECISION\_HYBRID

> 
> 
> - Hybrid mode will convert NATIVE tensor data types to FP16 and will select kernels that use an FP32 accumulator.
>     - Hybrid mode offers a good trade-off between performance and accuracy.
- QNN\_GPU\_PRECISION\_USER\_PROVIDED

> 
> 
> - This is the default precision mode when a custom config has not been provided.
>     - The QNN GPU backend will not optimize NATIVE tensor data types.

## Performance Hints

The QNN GPU offers three performance hints via the QNN context custom config feature
(see QnnGpuContext\_CustomConfig\_t and
QnnGpuContext\_PerfHint\_t). These hints are:

- QNN\_GPU\_CONTEXT\_PERF\_HINT\_HIGH

> 
> 
> - The HIGH perf hint will maximize GPU clock frequencies.
>     - HIGH perf hint offers the best inference latency at the expense of power consumption.
>     - This is the default.
- QNN\_GPU\_CONTEXT\_PERF\_HINT\_NORMAL

> 
> 
> - The NORMAL perf hint offers balanced performance dependent upon power management.
- QNN\_GPU\_CONTEXT\_PERF\_HINT\_LOW

> 
> 
> - The LOW perf hint will minimize GPU clock frequencies.
>     - LOW perf hint offers the lowest power consumption at the expense of inference latency.

Note that performance hints are included in the context cache. However, calls to
QnnContext\_setConfig can override the
cached performance hint setting.

## Context Configs

QnnContext custom configs (QnnGpuContext\_CustomConfig\_t)
and Context Priority (see Qnn\_Priority\_t
and QnnContext\_ConfigOption\_t) are supported.

## Backend Configs

QnnBackend custom configs (QnnGpuBackend\_CustomConfig\_t)
and (QnnGpuBackend\_ConfigOption\_t)
are supported.

## QnnDevice API Usage

file\_include\_QNN\_GPU\_QnnGpuDevice.h is the backend specialization header that goes along with
file\_include\_QNN\_QnnDevice.h. This header file allows clients to configure the QnnDevice to
cater to specific use-cases. For a multiple-GPU use case this API can be used to specify which of the
GPUs to target for inferencing.

**QNN GPU Device Type Enums (QnnGpuDevice\_DeviceType\_t)**

| Option Name | Option Description |
| --- | --- |
| QNN\_GPU\_DEVICE\_PRIMARY\_GPU | Integer value used to specify a primary GPU for a multi-GPU use case or used to<br>specify the sole GPU for a singular GPU system. |
| QNN\_GPU\_DEVICE\_SECONDARY\_GPU | Integer value used to specify a secondary GPU for a multi-GPU use case. |

To leverage the multi-GPU use case, it is recommended to use the QnnDevice APIs to query for all possible GPUs to target.

GPU QnnDevice Example

1const QnnDevice_PlatformInfo_t* platformInfo = nullptr;
    2auto status = QnnDevice_getPlatformInfo(m_logHandle, &platformInfo);
    Copy to clipboard

The type of GPU will be specified under the QnnDevice\_HardwareDeviceInfoV1\_t struct’s deviceType field which will be
populated with the QNN\_GPU\_DEVICE\_PRIMARY\_GPU or QNN\_GPU\_DEVICE\_SECONDARY\_GPU values. Afterwards, a Qnn\_DeviceHandle\_t can
be generated via the following:

GPU QnnDevice Example Continued

1QnnDevice_Config_t qnnDeviceConfig;
    2qnnDeviceConfig.option       = QNN_DEVICE_CONFIG_OPTION_PLATFORM_INFO;
    3qnnDeviceConfig.hardwareInfo = (QnnDevice_PlatformInfo_t*)platformInfo;
    4
    5const QnnDevice_Config_t* configs[] = {&qnnDeviceConfig, nullptr};
    6Qnn_DeviceHandle_t deviceHandle;
    7EXPECT_EQ(QnnDevice_create(m_logHandle, configs, &deviceHandle), QNN_SUCCESS);
    Copy to clipboard

Ensure that the hardwareInfo field is associated with one device if using a multi-GPU system.
Alternatively, in a multi-GPU system targeting a specific GPU through the qnn-net-run and qnn-context-binary-generator tools can be
achieved as follows:

$ export MODEL_LIB="<qnn_model.so>"
    $ export BACKEND_LIB="<path_to_backend_library>/libQnnGpu.so"
    $ export INPUT_LIST="<path_to_input_list_file>"
    $ export CONFIG_FILE="<path_to_JSON_of_backend_extensions>"
    $ qnn-net-run --model $MODEL_LIB \
                  --backend $BACKEND_LIB \
                  --input_list $INPUT_LIST \
                  --config_file $CONFIG_FILE
    Copy to clipboard

A sample json used in the context binary generation for a multi-GPU use case is provided below:

{
        "graph_names": ["model_name"],
        "precision_mode": "fp16",
        "device_id": 0
    }
    Copy to clipboard

Note that the “device\_id” field is zero-indexed with values from 0 to N-1 where N is the numbers of GPU on the targeted platform.

## Disabling Optimizations

The QNN GPU backend offers three features to disable the corresponding optimization. These features are enabled via the
custom graph config (see QnnGpuGraph\_CustomConfig\_t).

The QNN GPU backend will share NATIVE tensor memory based upon analysis of the network topology. When
disableMemoryOptimizations is non-zero, each tensor in the
model will be allocated unique memory and sharing is disabled.

The QNN GPU backend will fuse compatible operations into one operation to improve
QnnGraph\_execute performance. When
disableNodeOptimizations is non-zero, operations will not be
fused and will be kept separate. qnn-net-run’s –debug option also disables operation
fusion.

The QNN GPU backend will use queue recording to improve
QnnGraph\_execute performance. When
disableQueueRecording is non-zero, queue recording is disabled.

## QNN GPU Backend Extensions

The QNN backend extension feature facilitates usage of the backend specific APIs, namely custom configurations.
More documentation on backend extensions can be found under qnn-net-run.
Note that the scope of QNN backend extensions is limited to qnn-net-run.

GPU Backend Extensions is an interface to provide custom options to GPU Backend. In the GPU backend, a list of graph
names is required if graph custom config options are specified as indicated by the dependencies in the schema below.
The graph custom config options will be applied to each graph. These options can be exercised by providing an extension
shared library libQnnGpuNetRunExtensions.so and a config file, if necessary. The schema for GPU backend extensions with
various options available in the config are shown below:

{
       "type": "object",
       "properties": {
         // Corresponds to the graph name provided to QnnGraph_create
         "graph_names" : {"type": "array", "items": {"type": "string"}},
    
         // Precision Mode [optional]
         // Corresponds to QnnGpuGraph_CustomConfig_t::precisionMode.
         "precision_mode": {"type": "string", "enum": ["fp16", "fp32", "hybrid"]},
    
         // Disable Memory Optimizations (e.g. sharing tensor memory) [optional]
         // Corresponds to QnnGpuGraph_CustomConfig_t::disableMemoryOptimizations.
         "disable_memory_optimizations": {"type": "boolean"},
    
         // Disable Node Optimizations (e.g. node fusion) [optional]
         // Corresponds to QnnGpuGraph_CustomConfig_t::disableNodeOptimizations.
         "disable_node_optimizations": {"type": "boolean"},
    
         // Kernel Disk Repository Path [optional]
         // Corresponds to QnnGpuContext_CustomConfig_t::kernelRepoDir.
         // Valid values are any valid path having read/write permissions.
         "kernel_repo_path": {"type": "string"},
    
         // Disable Recordable Command Queue [optional]
         // Corresponds to QnnGpuGraph_CustomConfig_t::disableQueueRecording.
         "disable_queue_recording" : {"type" : "boolean"},
    
         // Context custom config performance hint [optional]
         // Corresponds to QnnGpuContext_CustomConfig_t::perfHint.
         "perf_hint": {"type": "string", "enum": ["high", "normal", "low"]}
    
         // Weight Sharing [optional]
         // Corresponds to QnnGpuGraph_CustomConfig_t::weightSharingEnabled.
         "weight_sharing": {"type": "boolean"},
    
         // Device Id [optional]
         // In a multi-GPU system, corresponds the targeted GPU
         "device_id": {"type": "integer"},
       },
       "dependencies": {
         "precision_mode": ["graph_names"],
         "disable_memory_optimizations": ["graph_names"],
         "disable_node_optimizations": ["graph_names"],
         "disable_queue_recording": ["graph_names"]
       }
     }
    Copy to clipboard

To use backend extension related parameters with qnn-net-run, use `--config_file` argument and give path to JSON file.

$ qnn-net-run --model <qnn_model_name.so> \
                      --backend <path_to_backend_library>/libQnnGpu.so \
                      --output_dir <output_dir_for_result> \
                      --input_list <path_to_input_list.txt>
                      --perf_profile <performance_mode_to_be_used>
                      --config_file <path_to_JSON_of_backend_extensions>
    Copy to clipboard

The above config file with minimum parameters such as backend extensions config specified through JSON is given below:

{
        "backend_extensions" :
        {
            "shared_library_path" : "path_to_shared_library",
            "config_file_path" : "path_to_config_file"
        }
    }
    Copy to clipboard

## Custom Profile Reader

The qnn-profile-viewer application can accept different readers and
writers. The QNN GPU backend offers the libQnnGpuProfilingReader.so library to output profiling data in a JSON format.

## Op Package Writing Guidelines

Detailed information regarding op package writing will be provided in a future release. In the meantime, please refer to
the op package example which can be found in `${QNN_SDK_ROOT}/examples/QNN/OpPackage/GPU/`.

## QNN Mem API Tutorial for GPU

The QNN GPU backend supports the usage of the QnnMem API to enable the usage of user-provided OpenCL buffers for input
and output tensors. Allowing users the capability to provide their own OpenCL buffers eliminates the need of data copy
between the host CPU and GPU.

- [QNN GPU QnnMem API Tutorial](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_qnnmem_api_tutorial.html)

## Tuning Mode (Beta)

When the tuning mode is enabled, all the kernels are iteratively profiled and the performance metrics are stored in the
performanceCache. The best performing kernels are then used to generate a contextBinary leading to a faster and
optimized model.

- [QNN GPU Tuning Mode Tutorial](https://docs.qualcomm.com/doc/80-63442-10/topic/gpu_tuning_mode_tutorial.html)

## Offline Prepare

The QNN GPU backend supports the offline generation of context binaries on x86-64-linux platforms. This can be
achieved using the `qnn-context-binary-generator` utility and by appending the path to the offline OpenCL compiler
(Adreno Offline Compiler (AOC)
) executable to the $PATH environment variable.

Note

The QNN GPU offline prepare feature is used in conjunction with the Adreno Offline Compiler. The
minimum version supported for this tool is 6.3.0 and is limited to partner builds.

Usage is as follows:

GPU Offline Prepare Example

1// Append path to offline OpenCL compiler to $PATH environment variable
     2$ export PATH=/path/to/offline_opencl_compiler_binary_executable/:$PATH
     3$ export MODEL_LIB="<qnn_model.so>"
     4$ export BACKEND_LIB="<path_to_backend_library>/libQnnGpu.so"
     5$ export BINARY_FILE_NAME="context_cache"
     6$ export CONFIG_FILE="<path_to_JSON_of_backend_extensions>"
     7$ export SOC_MODEL="<soc_model_to_prepare_for>"
     8$ qnn-context-binary-generator --model $MODEL_LIB \
     9                 --backend $BACKEND_LIB \
    10                 --binary_file $BINARY_FILE_NAME \
    11                 --config_file $CONFIG_FILE \
    12                 --soc_model $SOC_MODEL
    Copy to clipboard

Following execution of the qnn-context-binary-generator, you should see a directory named “output” with the corresponding
context binary within with the file name being what was passed for the “–binary\_file” argument.

**Supported Devices for QNN GPU Offline Prepare**

| Device | SOC Model |
| --- | --- |
| QCS7230 | 51 |
| SM8350 | 30 |
| SM8325 | 34 |
| SM7475 | 54 |
| SM8450 | 36 |
| SXR2230P | 53 |
| SC8380XP | 60 |
| SM7675 | 70 |
| SM8650 | 57 |
| QCS8625 | 90 |
| SM8750 | 69 |
| SM8850 | 87 |
| SM8975 | 103 |

## Other Notes

- Variable input dimensions (e.g. batch) are currently not supported
- Variable output dimensions are currently not supported
- Signed zero values are supported

Last Published: Jul 02, 2026

[Previous Topic
CPU](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/cpu_backend.md) [Next Topic
QNN GPU QnnMem API Tutorial](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/gpu_qnnmem_api_tutorial.md)