# Tutorial - Running Inference Using Shared Memory

Qualcomm® AI Engine Direct Delegate provides APIs for users to allocate specified tensors, usually graph inputs and outputs
on shared memory to reduce huge tensor copying time from TFLlite CPU to Qualcomm® AI Engine Direct. This feature
can accelerate inference speed.

This feature is only able to use with HTP backend for now.

Users need to do shared memory resource management by themselves.
Please check `TfLiteQnnDelegateAllocCustomMem` and `TfLiteQnnDelegateFreeCustomMem` for more information.

A TFLite interpreter provides `SetCustomAllocationForTensor` API to set a custom memory allocation
for the given tensor. Please call `AllocateTensors` after setting custom allocation to
make sure no invalid/insufficient buffers.

A fully delegated model with huge graph input/output benefits the most.

## Workflow of using shared memory

For creating an application using shared memory, we prescribe the below pattern:

1. [Step 1: Try to request enough memory space on shared memory](https://docs.qualcomm.com/doc/80-63442-10/topic/tutorial_htp_shared_memory.html#step-1-try-to-request-enough-memory-space-on-shared-memory)
2. [Step 2: Set the custom allocate tensor info](https://docs.qualcomm.com/doc/80-63442-10/topic/tutorial_htp_shared_memory.html#step-2-set-the-custom-allocate-tensor-info)
3. [Step 3: Assign a custom memory allocation for the given tensor](https://docs.qualcomm.com/doc/80-63442-10/topic/tutorial_htp_shared_memory.html#step-3-assign-a-custom-memory-allocation-for-the-given-tensor)
4. [Step 4: Free the allocated tensor at the end](https://docs.qualcomm.com/doc/80-63442-10/topic/tutorial_htp_shared_memory.html#step-4-free-the-allocated-tensor-at-the-end)

### Step 1: Try to request enough memory space on shared memory

void* custom_ptr = TfLiteQnnDelegateAllocCustomMem(num_bytes, tflite::kDefaultTensorAlignment);
    Copy to clipboard

**num\_bytes**: To get exact or \_enough\_ output tensor bytes.

**tflite::kDefaultTensorAlignment**: TfLite default alignment.

**custom\_ptr**: Pointer to the shared buffer on success; NULL on failure.

### Step 2: Set the custom allocate tensor info

TfLiteCustomAllocation tensor_alloc = {custom_ptr, num_bytes};
    Copy to clipboard

Wrap the shared buffer and tensor bytes together as a `TfLiteCustomAllocation`.

### Step 3: Assign a custom memory allocation for the given tensor

interpreter_->SetCustomAllocationForTensor(tensor_idx, tensor_alloc);
    Copy to clipboard

**tensor\_idx**: Tensor index

**tensor\_alloc**: TfLiteCustomAllocation

### Step 4: Free the allocated tensor at the end

TfLiteQnnDelegateFreeCustomMem(custom_ptr);
    Copy to clipboard

**custom\_ptr**: Allocated shared buffer pointer.

## A Running Example of using shared memory

This tutorial demonstrates how to run a model using shared memory.

#include "QNN/TFLiteDelegate/QnnTFLiteDelegate.h"
    
    // Setup interpreter with .tflite model.
    
    // Create QNN Delegate options structure.
    TfLiteQnnDelegateOptions options = TfLiteQnnDelegateOptionsDefault();
    
    // Set the mandatory backend_type option as HTP.
    options.backend_type = kHtpBackend;
    
    // Instantiate delegate. Must not be freed until interpreter is freed.
    // Please use QNN Delegate interface rather than external delegate interface.
    TfLiteDelegate* delegate = TfLiteQnnDelegateCreate(&options);
    
    // Allocate enough memory space on shared memory
    void* custom_ptr = TfLiteQnnDelegateAllocCustomMem(num_bytes, tflite::kDefaultTensorAlignment);
    
    // Assigns (or reassigns) a custom memory allocation for the given tensor and re-allocate tensors.
    TfLiteCustomAllocation tensor_alloc = {custom_ptr, num_bytes};
    interpreter_->SetCustomAllocationForTensor(tensor_idx, tensor_alloc);
    interpreter_->AllocateTensors();
    
    // Register QNN Delegate with TfLite interpreter to automatically delegate nodes.
    interpreter_->ModifyGraphWithDelegate(delegate);
    
    // Perform inference with interpreter as usual.
    interpreter_->Invoke();
    
    // User is responsible to free the allocated memory.
    TfLiteQnnDelegateFreeCustomMem(custom_ptr);
    
    // Delete delegate after interpreter no longer needed.
    TfLiteQnnDelegateDelete(delegate);
    Copy to clipboard

The output should look like:

INFO: Initialized TensorFlow Lite runtime.
    INFO: TfLiteQnnDelegate delegate: 128 nodes delegated out of 128 nodes with 1 partitions.
    
    INFO: Replacing 128 node(s) with delegate (TfLiteQnnDelegate) node, yielding 1 partitions.
    INFO: Tensor 0 is successfully registered to shared memory.
    INFO: Tensor 319 is successfully registered to shared memory.
    Copy to clipboard

Last Published: Jun 04, 2026

[Previous Topic
Tutorial - Benchmarking the Qualcomm® AI Engine Direct Delegate](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/tutorial_benchmark_model.md) [Next Topic
Tutorial - Use Mix-Precision Model with Qualcomm® AI Engine Direct Delegate](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/tutorial_mix_precision.md)