# QNN async execution for HTP backend

Clients can call into the QnnGraph\_executeAsync API to execute graphs asynchronously. Calling this API queues the execute requests and return.
The execution happens in the background and the client will be notified upon execution completion using the notification function registered in the
async execute call.

1. The main purpose of the QnnGraph\_executeAsync API is to let client queue up the same or different graphs from a single thread and NOT be blocked.
2. Improvement can be seen in overall inference rate when multiple back-to-back inferences are queued.

## Notification function

The `notifyFn` argument passed in the QnnGraph\_executeAsync API is expected to be short and mainly should be used for signalling other threads to perform the action.
Keeping it long will cause delays in execution of other inferences in the process.

### Queue depth

Clients can configure this queue depth by passing QNN\_CONTEXT\_CONFIG\_ASYNC\_EXECUTION\_QUEUE\_DEPTH configuration in
QnnContext\_create API for online prepare, or QnnContext\_createFromBinary API for offline prepared graph.

The maximum value of this queue depth is 300. Passing a value higher than 300 will result in an error when the context is read.

Note

1. The queue depth can be configured only by passing a configuration in the above mentioned APIs. Once the context has been created, the queue depth cannot be changed.
2. Passing `QNN_CONTEXT_CONFIG_ASYNC_EXECUTION_QUEUE_DEPTH` in QnnContext\_setConfig will not modify the depth.
3. Queueing executions in QnnGraph\_executeAsync beyond `QNN_CONTEXT_CONFIG_ASYNC_EXECUTION_QUEUE_DEPTH` value configured will block the QnnGraph\_executeAsync call
until space is available.
4. For offline prepared graph, the queue depth configured in the serialized binary will not be parsed. The queue depth has to be explicitly passed to QnnContext\_createFromBinary API.
5. If queue depth is not configured by client, default queue depth of 10 is used.

## Queue depth configuration usage example

Below is the example usage for creating QnnContext\_Config\_t for queue depth configuration with value of 25,

1constexpr uint8_t depth = 25;
    2QnnContext_AsyncExecutionQueueDepth_t asyncExecQueueDepthCfg;
    3asyncExecQueueDepthCfg.type  = QNN_CONTEXT_ASYNC_EXECUTION_QUEUE_DEPTH_TYPE_NUMERIC;
    4asyncExecQueueDepthCfg.depth = depth;
    5
    6QnnContext_Config_t contextQueueDepthConfig;
    7contextQueueDepthConfig.option             = QNN_CONTEXT_CONFIG_ASYNC_EXECUTION_QUEUE_DEPTH;
    8contextQueueDepthConfig.asyncExeQueueDepth = asyncExecQueueDepthCfg;
    9const QnnContext_Config_t* contextConfig[] = {&contextQueueDepthConfig, NULL};
    Copy to clipboard

The `contextConfig` array variable created in the above example can be passed to `config` parameter field in either the QnnContext\_create API for online prepare or the QnnContext\_createFromBinary API for offline prepare.

### QnnGraph\_executeAsync handles

| in error but there is a possibility of data to be overwritten.                |

### Platform and chipset support

The async execution for HTP backend is supported only for,

1. All QNX ADAS platform

2. QNX IVI platform, with Hexagon architecture &gt;= 73
2. RPC stub skel mode.

### Feature or configuration list and support

| Features | Support | Comments |
| --- | --- | --- |
| SSR | Not Supported | It is the client’s<br>responsibility to restore the session if<br>QNN\_COMMON\_ERROR\_SYSTEM or<br>QNN\_COMMON\_ERROR\_SYSTEM\_COMMUNICATION received<br>SSR cleanup not supported. Cleanup will be<br>supported in future |
| VTCM sharing | Supported |  |
| Execution Priorities | Supported | Priority cannot be changed after graph execute<br>started |
| Multiple Device + cores | Supported |  |
| QNN Signal | Supported |  |
| RPC Polling Perf setting | Ignored for async |  |
| Profiling | Supported |  |
| QNN\_GRAPH\_CONFIG\_OPTION\_PROFILE\_HANDLE | Not Supported |  |
| Parallel Graph Execution | Support depends on parallel<br>graph execution feature support | Currently, a maximum of 2 graphs of same<br>priority can run in parallel. Async execution<br>allows evaluation of 3 graphs from the front of<br>the queue to determine if 2 out of these 3<br>graphs can be executed in parallel. |

### Performance

1. Performance benefit can be observed only if memory type of all tensors is QNN\_TENSORMEMTYPE\_MEMHANDLE and profiling level for the inference is basic.

### Limitations list

1. Graph Priority cannot be changed after the execution started for the graph.
2. Queue depth configuration cannot be modified after the context is created.
3. It is the client’s responsibility to pass different input, output, profile and signal handles for inferences queued at a time.
4. Performance benefit will not be seen when tensor memory type is not QNN\_TENSORMEMTYPE\_MEMHANDLE or if profiling level is not basic.
5. Executing the same graphHandle synchronously and asynchronously at the same time is not supported.
6. Asynchronous Execution is not supported in non RPC mode.

Last Published: Jun 04, 2026

[Previous Topic
QNN HTP SSR](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/htp_ssr.md) [Next Topic
QNN HTP Qmem Graph](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/htp_qmem_graph.md)