# GPU

This section provides information about the QNN GPU backend.

- [API Specializations](https://docs.qualcomm.com/doc/80-63442-50/topic/gpu_backend.html#api-specializations)
- [Operation Limitations](https://docs.qualcomm.com/doc/80-63442-50/topic/gpu_backend.html#operation-limitations)
- [Kernel Persistence](https://docs.qualcomm.com/doc/80-63442-50/topic/gpu_backend.html#kernel-persistence)
- [Precision Mode](https://docs.qualcomm.com/doc/80-63442-50/topic/gpu_backend.html#precision-mode)
- [Performance Hints](https://docs.qualcomm.com/doc/80-63442-50/topic/gpu_backend.html#performance-hints)
- [Context Configs](https://docs.qualcomm.com/doc/80-63442-50/topic/gpu_backend.html#context-configs)
- [Backend Configs](https://docs.qualcomm.com/doc/80-63442-50/topic/gpu_backend.html#backend-configs)
- [Disabling Optimizations](https://docs.qualcomm.com/doc/80-63442-50/topic/gpu_backend.html#disabling-optimizations)
- [QNN GPU Backend Extensions](https://docs.qualcomm.com/doc/80-63442-50/topic/gpu_backend.html#qnn-gpu-backend-extensions)
- [Custom Profile Reader](https://docs.qualcomm.com/doc/80-63442-50/topic/gpu_backend.html#custom-profile-reader)
- [Op Package Writing Guidelines](https://docs.qualcomm.com/doc/80-63442-50/topic/gpu_backend.html#op-package-writing-guidelines)
- [QNN Mem API Tutorial for GPU](https://docs.qualcomm.com/doc/80-63442-50/topic/gpu_backend.html#qnn-mem-api-tutorial-for-gpu)
- [Tuning Mode (Beta)](https://docs.qualcomm.com/doc/80-63442-50/topic/gpu_backend.html#tuning-mode-beta)
- [Other Notes](https://docs.qualcomm.com/doc/80-63442-50/topic/gpu_backend.html#other-notes)

## API Specializations

This section contains information related to API specialization for the GPU backend. All QNN GPU
backend specialization is available under the `${QNN_SDK_ROOT}/include/QNN/GPU/` directory.

The current version of the QNN GPU backend API is:

- QNN\_GPU\_API\_VERSION\_MAJOR 3

    - 

- QNN\_GPU\_API\_VERSION\_MINOR 11

    - 

- QNN\_GPU\_API\_VERSION\_PATCH 0

    -

## Operation Limitations

QNN GPU operation limitations are documented in [GPU Backend Op Definition Supplement](https://docs.qualcomm.com/doc/80-63442-50/topic/GpuOpDefSupplement.html#gpu-backend-op-definition-supplement).

## Kernel Persistence

The QNN GPU backend supports two kernel persistence strategies held within a QNN Context: in-memory and on-disk.
We refer to the in-memory
persistence as the kernel registry and we refer to the on-disk persistence as the kernel repository. These are two
mechanisms whereby kernels are re-used to reduce model initialization time. The following will outline how to use these
features by introducing a simple use case.

A user creates a new QNN GPU Context by calling
[QnnContext\_create](https://docs.qualcomm.com/doc/80-63442-50/topic/function_QnnContext_8h_1a32eb31e802e865f4cb33c5097e367773.html#exhale-function-qnncontext-8h-1a32eb31e802e865f4cb33c5097e367773) with a custom config
setting providing a valid [kernelRepoDir](https://docs.qualcomm.com/doc/80-63442-50/topic/structQnnGpuContext__CustomConfig__t.html#exhale-struct-structqnngpucontext-customconfig-t). Let’s assume this
path is `${QNN_GPU_KERNEL_REPO}`. Assume that there is no existing on-disk repo corresponding to this path. Therefore,
kernels will not be deserialized and the in-memory registry will contain no kernels. Kernels originating from the
built-in qti.aisw op package will be deserialized during
[QnnContext\_create](https://docs.qualcomm.com/doc/80-63442-50/topic/function_QnnContext_8h_1a32eb31e802e865f4cb33c5097e367773.html#exhale-function-qnncontext-8h-1a32eb31e802e865f4cb33c5097e367773). Kernels originating from
another op package will be deserialized when that op package is registered via
[QnnBackend\_registerOpPackage](https://docs.qualcomm.com/doc/80-63442-50/topic/function_QnnBackend_8h_1a95dd59ad0b59872f3649f7c363c23441.html#exhale-function-qnnbackend-8h-1a95dd59ad0b59872f3649f7c363c23441).

A user creates model A and finalizes it. Suppose that
model A comprises of kernels 1, 2, and 3. These kernels are created from scratch and added to the in-memory kernel
registry. A user creates model B and finalizes it. Suppose that model B comprises of kernels 3 and 4. Kernel 3 will be
recovered from the in-memory kernel registry and kernel 4 will be created from scratch and added to the registry.

The user now calls [QnnContext\_free](https://docs.qualcomm.com/doc/80-63442-50/topic/function_QnnContext_8h_1ada3a582e9ab571599958c60665c7a2c8.html#exhale-function-qnncontext-8h-1ada3a582e9ab571599958c60665c7a2c8). Since
a valid kernel repo path was provided, the QNN GPU Context will serialize in-memory kernels and, for each op package,
write them to `${QNN_GPU_KERNEL_REPO}/gpukernelcache.${OP_PKG_NAME}` where OP\_PKG\_NAME is the op package
[packageName](https://docs.qualcomm.com/doc/80-63442-50/topic/structQnnOpPackage__Info__t.html#exhale-struct-structqnnoppackage-info-t).

If the user creates another QNN GPU Context specifying the same kernel repo path, these kernels will be deserialized
as outlined above and added to the in-memory kernel registry. If the user now creates model A or B, all kernels will be
ready for creation via the in-memory registry, greatly reducing initialization time.

Note that an op package provides a
[kernelRepoHash](https://docs.qualcomm.com/doc/80-63442-50/topic/structQnnGpuOpPackage__PackageInfo__t.html#exhale-struct-structqnngpuoppackage-packageinfo-t) to the Context. If the QNN
GPU Context detects that an on-disk kernel repository was generated by an op package of the same name, but with a
different kernelRepoHash, the on-disk repository will be automatically invalidated. This ensures that kernel version
mis-matches do not occur.

Also note that these QNN GPU kernel persistence features are separate from the QNN context cache feature (see
[QnnContext\_getBinary](https://docs.qualcomm.com/doc/80-63442-50/topic/function_QnnContext_8h_1aa1c220389821ddf1e9d0de46b8fba0f9.html#exhale-function-qnncontext-8h-1aa1c220389821ddf1e9d0de46b8fba0f9)). A QNN GPU context cache
will store everything needed to re-create a context, including kernels.

## Precision Mode

The QNN GPU backend offers four precision modes via the QNN graph custom config feature
(see [QnnGpuGraph\_CustomConfig\_t](https://docs.qualcomm.com/doc/80-63442-50/topic/structQnnGpuGraph__CustomConfig__t.html#exhale-struct-structqnngpugraph-customconfig-t) and
[QnnGpu\_Precision\_t](https://docs.qualcomm.com/doc/80-63442-50/topic/enum_QnnGpuGraph_8h_1af5b8531e7f98c28fcf5dc896252a70b9.html#exhale-enum-qnngpugraph-8h-1af5b8531e7f98c28fcf5dc896252a70b9)). These modes are:

- QNN\_GPU\_PRECISION\_FP32 (FP32 mode)

> 
> 
> - FP32 mode will convert NATIVE tensor data types to FP32 and will select kernels that use an FP32 accumulator.
>     - FP32 mode offers the best accuracy at the expense of performance.
- QNN\_GPU\_PRECISION\_FP16 (FP16 mode)

> 
> 
> - FP16 mode will convert NATIVE tensor data types to FP16 and will select kernels that use an FP16 accumulator
> where possible.
>     - FP16 mode offers the best performance at the expense of accuracy.
- QNN\_GPU\_PRECISION\_HYBRID

> 
> 
> - Hybrid mode will convert NATIVE tensor data types to FP16 and will select kernels that use an FP32 accumulator.
>     - Hybrid mode offers a good trade-off between performance and accuracy.
- QNN\_GPU\_PRECISION\_USER\_PROVIDED

> 
> 
> - This is the default precision mode when a custom config has not been provided.
>     - The QNN GPU backend will not optimize NATIVE tensor data types.

## Performance Hints

The QNN GPU offers three performance hints via the QNN context custom config feature
(see [QnnGpuContext\_CustomConfig\_t](https://docs.qualcomm.com/doc/80-63442-50/topic/structQnnGpuContext__CustomConfig__t.html#exhale-struct-structqnngpucontext-customconfig-t) and
[QnnGpuContext\_PerfHint\_t](https://docs.qualcomm.com/doc/80-63442-50/topic/enum_QnnGpuContext_8h_1a3d215d3dbd62c5c5668acacd384578b2.html#exhale-enum-qnngpucontext-8h-1a3d215d3dbd62c5c5668acacd384578b2)). These hints are:

- QNN\_GPU\_CONTEXT\_PERF\_HINT\_HIGH

> 
> 
> - The HIGH perf hint will maximize GPU clock frequencies.
>     - HIGH perf hint offers the best inference latency at the expense of power consumption.
>     - This is the default.
- QNN\_GPU\_CONTEXT\_PERF\_HINT\_NORMAL

> 
> 
> - The NORMAL perf hint offers balanced performance dependent upon power management.
- QNN\_GPU\_CONTEXT\_PERF\_HINT\_LOW

> 
> 
> - The LOW perf hint will minimize GPU clock frequencies.
>     - LOW perf hint offers the lowest power consumption at the expense of inference latency.

Note that performance hints are included in the context cache. However, calls to
[QnnContext\_setConfig](https://docs.qualcomm.com/doc/80-63442-50/topic/function_QnnContext_8h_1a380694614a9167136c744b21da34feb7.html#exhale-function-qnncontext-8h-1a380694614a9167136c744b21da34feb7) can override the
cached performance hint setting.

## Context Configs

QnnContext custom configs ([QnnGpuContext\_CustomConfig\_t](https://docs.qualcomm.com/doc/80-63442-50/topic/structQnnGpuContext__CustomConfig__t.html#exhale-struct-structqnngpucontext-customconfig-t))
and Context Priority (see [Qnn\_Priority\_t](https://docs.qualcomm.com/doc/80-63442-50/topic/enum_QnnTypes_8h_1a4394623faa5580a396f83dac19565e4d.html#exhale-enum-qnntypes-8h-1a4394623faa5580a396f83dac19565e4d)
and [QnnContext\_ConfigOption\_t](https://docs.qualcomm.com/doc/80-63442-50/topic/enum_QnnContext_8h_1a054235316eddc82552593ec91318f90e.html#exhale-enum-qnncontext-8h-1a054235316eddc82552593ec91318f90e)) are supported.

## Backend Configs

QnnBackend custom configs ([QnnGpuBackend\_CustomConfig\_t](https://docs.qualcomm.com/doc/80-63442-50/topic/structQnnGpuBackend__CustomConfig__t.html#exhale-struct-structqnngpubackend-customconfig-t))
and (QnnGpuBackend\_ConfigOption\_t)
are supported.

## Disabling Optimizations

The QNN GPU backend offers three features to disable the corresponding optimization. These features are enabled via the
custom graph config (see [QnnGpuGraph\_CustomConfig\_t](https://docs.qualcomm.com/doc/80-63442-50/topic/structQnnGpuGraph__CustomConfig__t.html#exhale-struct-structqnngpugraph-customconfig-t)).

The QNN GPU backend will share NATIVE tensor memory based upon analysis of the network topology. When
[disableMemoryOptimizations](https://docs.qualcomm.com/doc/80-63442-50/topic/structQnnGpuGraph__CustomConfig__t.html#exhale-struct-structqnngpugraph-customconfig-t) is non-zero, each tensor in the
model will be allocated unique memory and sharing is disabled.

The QNN GPU backend will fuse compatible operations into one operation to improve
[QnnGraph\_execute](https://docs.qualcomm.com/doc/80-63442-50/topic/function_QnnGraph_8h_1a3ea05f42a9295f9a74a2e3a0cdd64228.html#exhale-function-qnngraph-8h-1a3ea05f42a9295f9a74a2e3a0cdd64228) performance. When
[disableNodeOptimizations](https://docs.qualcomm.com/doc/80-63442-50/topic/structQnnGpuGraph__CustomConfig__t.html#exhale-struct-structqnngpugraph-customconfig-t) is non-zero, operations will not be
fused and will be kept separate. [qnn-net-run](https://docs.qualcomm.com/doc/80-63442-50/topic/tools.html#qnn-net-run)’s –debug option also disables operation
fusion.

The QNN GPU backend will use queue recording to improve
[QnnGraph\_execute](https://docs.qualcomm.com/doc/80-63442-50/topic/function_QnnGraph_8h_1a3ea05f42a9295f9a74a2e3a0cdd64228.html#exhale-function-qnngraph-8h-1a3ea05f42a9295f9a74a2e3a0cdd64228) performance. When
[disableQueueRecording](https://docs.qualcomm.com/doc/80-63442-50/topic/structQnnGpuGraph__CustomConfig__t.html#exhale-struct-structqnngpugraph-customconfig-t) is non-zero, queue recording is disabled.

## QNN GPU Backend Extensions

The QNN backend extension feature facilitates usage of the backend specific APIs, namely custom configurations.
More documentation on backend extensions can be found under [qnn-net-run](https://docs.qualcomm.com/doc/80-63442-50/topic/tools.html#qnn-net-run).
Note that the scope of QNN backend extensions is limited to qnn-net-run.

GPU Backend Extensions is an interface to provide custom options to GPU Backend. In the GPU backend, a list of graph
names is required if graph custom config options are specified as indicated by the dependencies in the schema below.
The graph custom config options will be applied to each graph. These options can be exercised by providing an extension
shared library libQnnGpuNetRunExtensions.so and a config file, if necessary. The schema for GPU backend extensions with
various options available in the config are shown below:

{
       "type": "object",
       "properties": {
         // Corresponds to the graph name provided to QnnGraph_create
         "graph_names" : {"type": "array", "items": {"type": "string"}},
    
         // Precision Mode [optional]
         // Corresponds to QnnGpuGraph_CustomConfig_t::precisionMode.
         "precision_mode": {"type": "string", "enum": ["fp16", "fp32", "hybrid"]},
    
         // Disable Memory Optimizations (e.g. sharing tensor memory) [optional]
         // Corresponds to QnnGpuGraph_CustomConfig_t::disableMemoryOptimizations.
         "disable_memory_optimizations": {"type": "boolean"},
    
         // Disable Node Optimizations (e.g. node fusion) [optional]
         // Corresponds to QnnGpuGraph_CustomConfig_t::disableNodeOptimizations.
         "disable_node_optimizations": {"type": "boolean"},
    
         // Kernel Disk Repository Path [optional]
         // Corresponds to QnnGpuContext_CustomConfig_t::kernelRepoDir.
         // Valid values are any valid path having read/write permissions.
         "kernel_repo_path": {"type": "string"},
    
         // Disable Recordable Command Queue [optional]
         // Corresponds to QnnGpuGraph_CustomConfig_t::disableQueueRecording.
         "disable_queue_recording" : {"type" : "boolean"},
    
         // Context custom config performance hint [optional]
         // Corresponds to QnnGpuContext_CustomConfig_t::perfHint.
         "perf_hint": {"type": "string", "enum": ["high", "normal", "low"]}
    
         // Weight Sharing [optional]
         // Corresponds to QnnGpuGraph_CustomConfig_t::weightSharingEnabled.
         "weight_sharing": {"type": "boolean"},
       },
       "dependencies": {
         "precision_mode": ["graph_names"],
         "disable_memory_optimizations": ["graph_names"],
         "disable_node_optimizations": ["graph_names"],
         "disable_queue_recording": ["graph_names"]
       }
     }
    Copy to clipboard

To use backend extension related parameters with qnn-net-run, use `--config_file` argument and give path to JSON file.

$ qnn-net-run --model <qnn_model_name.so> \
                      --backend <path_to_backend_library>/libQnnGpu.so \
                      --output_dir <output_dir_for_result> \
                      --input_list <path_to_input_list.txt>
                      --perf_profile <performance_mode_to_be_used>
                      --config_file <path_to_JSON_of_backend_extensions>
    Copy to clipboard

The above config file with minimum parameters such as backend extensions config specified through JSON is given below:

{
        "backend_extensions" :
        {
            "shared_library_path" : "path_to_shared_library",
            "config_file_path" : "path_to_config_file"
        }
    }
    Copy to clipboard

## Custom Profile Reader

The [qnn-profile-viewer](https://docs.qualcomm.com/doc/80-63442-50/topic/tools.html#qnn-profile-viewer) application can accept different readers and
writers. The QNN GPU backend offers the libQnnGpuProfilingReader.so library to output profiling data in a JSON format.

## Op Package Writing Guidelines

Detailed information regarding op package writing will be provided in a future release. In the meantime, please refer to
the op package example which can be found in `${QNN_SDK_ROOT}/examples/QNN/OpPackage/GPU/`.

## QNN Mem API Tutorial for GPU

The QNN GPU backend supports the usage of the QnnMem API to enable the usage of user-provided OpenCL buffers for input
and output tensors. Allowing users the capability to provide their own OpenCL buffers eliminates the need of data copy
between the host CPU and GPU.

- [QNN GPU QnnMem API Tutorial](https://docs.qualcomm.com/doc/80-63442-50/topic/gpu_qnnmem_api_tutorial.html)

## Tuning Mode (Beta)

When the tuning mode is enabled, all the kernels are iteratively profiled and the performance metrics are stored in the
performanceCache. The best performing kernels are then used to generate a contextBinary leading to a faster and
optimized model.

- [QNN GPU Tuning Mode Tutorial](https://docs.qualcomm.com/doc/80-63442-50/topic/gpu_tuning_mode_tutorial.html)

## Other Notes

- Variable input dimensions (e.g. batch) are currently not supported
- Variable output dimensions are currently not supported
- Signed zero values are supported

Last Published: Oct 10, 2025

[Previous Topic
QNN CPU Backend Extensions](https://docs.qualcomm.com/bundle/publicresource/80-63442-50/topics/cpu_backend.md) [Next Topic
Saver](https://docs.qualcomm.com/bundle/publicresource/80-63442-50/topics/saver_backend.md)