# QNN HTP Qmem Graph

Currently QnnGraph supports inferences for RAW buffers and MemHandles.
Raw buffers are not accessible from DSP side and at Graph creation, QNN HTP reserves extra RPCMem buffers to copy RAW buffers before inference (copy out output buffers after inference)
If the client uses MemHandles (handles to buffers allocated using rpcmem\_alloc and accessible from DSP side) no copy is needed and those internal buffers are not used.
Qmem Graph allows the client to pass a hint at graph preparation that this use case will be using RPCMem buffers and there is no need to allocate internal extra RPCmem memory.
If the client uses Qmem Graph hint at graph creation time and still passes RAW buffers at inference time, QNN HTP will allocate extra buffers at run time (expected performance impact for that inference).

## Online Prepare

**Preparation**

When doing online prepare, the hint (IO tensor memory type) that informs QnnHtp backend to reduce memory allocation is embedded in model.so file.
This information can be passed in by `QnnTensor_createGraphTensor` and `QnnGraph_addNode`.

1// example graph. for detail, please refer to Sample App
     2Qnn_GraphHandle_t graph;
     3// IO tensors
     4Qnn_Tensor_t inputTensor;
     5Qnn_Tensor_t outputTensor;
     6// Set up common setting for tensors ......
     7/* There are 2 specific settings for shared buffer:
     8*  1. memType should be QNN_TENSORMEMTYPE_MEMHANDLE;
     9*  2. union member memHandle should be used instead of clientBuf, and it
    10*  should be set to nullptr.
    11*/
    12inputTensor.v1.memType        = QNN_TENSORMEMTYPE_MEMHANDLE;
    13inputTensor.v1.clientBuf      = nullptr;
    14outputTensor.v1.memType       = QNN_TENSORMEMTYPE_MEMHANDLE;
    15outputTensor.v1.clientBuf     = nullptr;
    16QnnTensor_createGraphTensor(graph, &inputTensor);
    17QnnTensor_createGraphTensor(graph, &outputTensor);
    18
    19// create OpConfig_t with IO tensor just created
    20Qnn_OpConfig_t opConfig;
    21QnnGraph_addNode(graph, opConfit);
    Copy to clipboard

**Execution**

Please refer to: [QNN HTP Shared Buffer Tutorial](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_shared_buffer_tutorial.html#shared-buffer-tutorial)

## Offline Prepare

When generating serialized.bin, it is recommended to generate serialized.bin with option `--input_output_tensor_mem_type memhandle` to reduce the memory
footprint. With this option used, qnn-context-binary-generator will change IO tensor memory type to memhandle. When QnnHtp backend loads serialized.bin, it will
be able to skip memory allocation for IO tensor and understand that the user intends to use shared\_buffer during execution.
Skipping this option will not impact inference performance.

**Preparation**

1// Prerequisites: model.so, qnn-context-binary-generator, QnnHtp backend .so library
    2
    3./qnn-context-binary-generator --model libqnn_model.so --backend libQnnHtp.so --binary_file qnngraph.serialized --output_dir output --input_output_tensor_mem_type memhandle
    4
    5// qnngraph.serialized.bin is generated and saved at output/qnngraph.serialized.bin
    Copy to clipboard

**Execution**

Please refer to: [QNN HTP Shared Buffer Tutorial](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_shared_buffer_tutorial.html#shared-buffer-tutorial)

**Mis-matching mem\_type during preparation and execution**

| Preparation | Execution | Behavior |
| --- | --- | --- |
| `raw` | `raw` | <ul class="simple"><br><li><p>QNN HTP will allocate memory for IO buffer.</p></li><br><li><p>HTP will copy input and output at each inference.</p></li><br></ul> |
| `raw` | `memhandle` | <ul class="simple"><br><li><p>QNN HTP will allocate memory for IO buffer.</p></li><br><li><p>Data copy avoided.</p></li><br></ul> |
| `memhandle` | `raw` | <ul class="simple"><br><li><p>QNN HTP will not allocate memory for IO buffer during preparation.</p></li><br><li><p>QNN HTP will allocate memory for IO buffer during first inference (<code class="docutils literal notranslate"><span class="pre">raw</span></code> passed in during execution), first inference time impact by memory allocation.</p></li><br><li><p>HTP will copy input and output at each inference.</p></li><br></ul> |
| `memhandle` | `memhandle` | <ul class="simple"><br><li><p>QNN HTP will not allocate memory for IO buffer during preparation.</p></li><br><li><p>Data copy avoided.</p></li><br></ul> |

Last Published: Jun 04, 2026

[Previous Topic
Asynchronous graph execution for HTP backend](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/htp_async_execute.md) [Next Topic
Multi-SoC DLC with Reference Weight Sharing](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/htp_multi_soc_dlc.md)