# Qualcomm Hexagon Plugin Interface (QHPI)

- [Overview](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#overview)

    - [Key Features](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#key-features)
    - [Key Terms](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#key-terms)
- [API/ABI Compatibility](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#api-abi-compatibility)
- [Quick-Start Checklist for Migrating to QHPI](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#quick-start-checklist-for-migrating-to-qhpi)
- [QHPI-Based Operator Package Details](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#qhpi-based-operator-package-details)

    - [QNN Operator Package Skeletal Generation](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#qnn-operator-package-skeletal-generation)
    - [Operator Implementations](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#operator-implementations)

        - [Operator definition](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#operator-definition)
        - [Kernel Implementation](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#kernel-implementation)
        - [Precomputation](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#precomputation)
        - [Multithreading](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#multithreading)
        - [Source Destructive](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#source-destructive)
        - [Cost Function](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#cost-function)
        - [Optimization Rules](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#optimization-rules)
        - [Tiling](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#tiling)
        - [Predicates](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#predicates)
    - [Operator registration](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#operator-registration)

## [Overview](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id2)

QHPI provides a set of **well-defined, strongly typed C APIs** that enable operator writers to create and register operators with the **QNN HTP Backend (BE)**. It replaces the legacy `DEF_PACKAGE_OPT`-based operator system in QNN HTP.

### [Key Features](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id3)

- **API/ABI Compatibility**

    Provides strong API and ABI compatibility for operator packages through versioned op package registration APIs and data structures. The versioned APIs enable customers to develop op packages using a given QHPI API version and continue to use those in future SDKs without any further changes if the customer does not need features from the newer QHPI API versions. A QHPI op package built with an older SDK is expected to continue working on newer SDKs **without** requiring recompilation.
- **Multi-threading Support**

    Provides access to improved performance on NPUs.
- **Smooth Transition**

    Coexists with legacy C++ `DEF_PACKAGE_OPT` based packages to simplify migration. QHPI integrates seamlessly with existing QNN tools and the QNN HTP Backend operator package workflow, requiring minimal changes.

### [Key Terms](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id4)

The rest of this document uses the following key terms:

- **Operator packages**

    An operator package (also referred to as an *op package*) is a collection of one or more HTP operator implementations.

    - Built as a dynamic linked library.
    - Provided at graph preparation and execution time to supply custom operator implementations.
    - Must implement the QNN Operator Package Interface API as defined in QnnOpPackage.h.
    - Includes a well-defined entry point for the HTP backend to invoke during dynamic loading.
    - For more details, see [Op Packages](https://docs.qualcomm.com/doc/80-63442-10/topic/op_packages.html).
- **Kernel**

    A specific C/C++ function invoked during execution. Kernels are typically associated with specific layouts and element types.
- **Operator**

    A named node in a machine learning graph. Operators are generic with respect to layout, element type, and storage placement. At the end of graph preparation, every operator is associated with a specific kernel.
- **Precomputation**

    The equivalent of the `COMPILER_FOR` macro, where a function can be invoked when a prepared graph is loaded for execution.
- **Multithreading**

    Operators can be invoked multiple times with distinct slice identifiers, enabling parallel execution across different hardware threads of the same type.

Please refer to `"<QNN_SDK_ROOT>/include/HTP/core/qhpi.h"` for QHPI API definitions.

## [API/ABI Compatibility](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id5)

API and ABI compatibility is one of the key capabilities that QHPI offers to operator packages. To accomplish this, all APIs and data structures defined via `"<QNN_SDK_ROOT>/include/HTP/core/qhpi.h"` are **subject to versioning** to ensure compatibility. Any significant changes to a data structure or function will be handled by the creation of a **new** version of that data structure or function, using `_vxxx` suffix, `XXX` is the version number.

The example below illustrates an updated QHPI operator data structure, <cite>QHPI_OpInfo_v1</cite>, which now references the latest QHPI kernel version, <cite>QHPI_Kernel_v1</cite>.

typedef struct {
       const char *name;
       uint32_t num_kernels;
       QHPI_Kernel_v1 *kernels;
       QHPI_RewriteOpFunc early_rewrite;
       QHPI_TileShapeRequired shape_required;
       QHPI_TileShapeLegalized shape_legalized;
       QHPI_BuildTileOfOp build_tile;
       QHPI_RewriteOpFunc late_rewrite;
    } QHPI_OpInfo_v1;
    Copy to clipboard

Here is an example of a versioned function in the API, <cite>qhpi_register_ops_v1</cite>, introduced to register the latest operator data structure, <cite>QHPI_OpInfo_v1</cite>.

// register a collection of v1 operators
    uint32_t qhpi_register_ops_v1(uint32_t num_ops, QHPI_OpInfo_v1 *operators, const char *package);
    Copy to clipboard

The versioned APIs and data structures will enable SDK users to develop op packages using a given SDK and continue using them on future SDKs without any further changes or recompilation.

## [Quick-Start Checklist for Migrating to QHPI](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id6)

Follow these steps to migrate an existing legacy operator package to QHPI:

1. **Update XML Configuration**

- Set `UseQHPI="true"` in the `OpDefCollection` element.
- Verify operator definitions in XML follow the OpDef schema.
- Reference: [QNN XML Op Def](https://docs.qualcomm.com/doc/80-63442-10/topic/op_def_schema.html).

2. **Generate QHPI Skeleton**

- Use `qnn-op-package-generator` to create a skeletal QHPI implementation.
- Command:

> 
> 
> <QNN_SDK_ROOT>/bin/x86_64-linux-clang/qnn-op-package-generator --config <xml_file> --output_path <output_dir>
>         Copy to clipboard
- Reference: [Generating Op Packages](https://docs.qualcomm.com/doc/80-63442-10/topic/generating_op_packages.html).

3. **Implement Op Package Entry Point**

- Ensure the generated skeleton includes the **mandatory** function `qhpi_init()`, which is required for successful loading and registration of the QHPI operator package.
- Register QHPI ops with QNN HTP BE using the appropriate versioned registration API. For example, the initial QHPI release contains `QHPI_register_ops_v1()` API that could be invoked from `qhpi_init()` to register operators with QNN HTP BE.

Note

`qhpi_init()` is the QHPI equivalent of the legacy macro `INIT_PKG_CORE_INIT_FUNC`.
QHPI supports versioned registration APIs; please pick the appropriate API based on SDK and op/kernel/tensor properties.

4. **Implement Operator Logic**

- Replace legacy macros (e.g., `REGISTER_OP`) with:

> 
> 
> - `QHPI_Kernel_vxxx` structures for kernel definitions.
>     - `QHPI_Tensor_Signature_vxxx` for input/output tensor properties.
>     - `QHPI_OpInfo_vxxx` for operator-to-kernel mapping.
- Ensure:

> 
> 
> - `function_name` is unique.
>     - Examine and initialize attributes such as `resources` and `source_destructive` appropriately.
>     - Order kernels appropriately. The most preferred kernel is listed first. Kernels are selected by QHPI in the order they appear in `QHPI_OpInfo_vxxx` by default. This can however be overridden by the op writer using <cite>Predicates</cite>. Refer to [Predicates](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#predicates-details) for more on this.

5. **Handle Kernel Invocation**

- Implement kernel execution functions for:

> 
> 
> - Default execution.
>     - Precomputation (optional).
- See the discussion on [Operator Implementations](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#operator-implementations) for details.

6. **Enable Advanced Features**

- **Multithreading**: Set `multithreaded` flag and use slice APIs if kernel implementation can be parallelized across multiple hardware threads.
- **Source Destructive**: Set `source_destructive` flag if the first input/output can share memory.
- **Cost Function**: Optional, used for predicting performance (not for kernel selection).

7. **Implement Rewrite Rules**

- Rewrite callbacks can be used to rewrite operators into a new subgraph containing other QHPI and QNN operators.
- Replace op rewrite `DEF_PACKAGE_OPTIMIZATION` rules with optimizations implemented using the appropriate QHPI C API rewrite callbacks.

> 
> 
> - `early_rewrite`: Optional function, when specified is invoked early during graph compilation prior to op tiling.
>     - `late_rewrite`: Optional function, when specified is invoked during graph compilation after op tiling.

8. **Implement Tiling Rules**

- Tiling callbacks allow the op writer to customize HTP’s choices on splitting a QHPI operator in a given graph.
- Replace any `DEF_PACKAGE_OPTIMIZATION` rules using `AUTOSPLIT` and `TILING` with implementations using appropriate QHPI tiling callbacks.

> 
> 
> - `shape_required`: Optional function, when implemented can be used to enforce requirements on tiling dimensions.
>     - `shape_legalized`: Optional function, when implemented can be used to adjust the HTP’s tile choice based on tiling heuristics.
>     - `build_tile`: Optional function, when implemented can be used to create a tiled operator to compute a specified output slice.

9. **Validate and Test**

- Build the operator package using Makefiles generated in step 2.
- Verify the operator package works with the HTP backend by following the workflow mentioned in the [Custom op package tutorial](https://docs.qualcomm.com/doc/80-63442-10/topic/tutorial1.html). Note that the steps for building and executing a model using QNN HTP BE have **not** changed due to QHPI.

10. Please refer to the following sample legacy to QHPI ports in the SDK.

- ${QNN\_SDK\_ROOT}/examples/QNN/OpPackage/HTP/QHPI/
- ${QNN\_SDK\_ROOT}/examples/QNN/OpPackageGenerator/generated/HTP/

The following section provides additional details on these steps.

## [QHPI-Based Operator Package Details](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id7)

A QNN operator package implementation consists of three components:

- **QNN operator package Skeletal Generation**
- **Operator Implementation**
- **Operator Registration**

QHPI does **not** change the QNN op package interface definition. The QNN operator package interface remains the same whether you use legacy APIs or QHPI for operator implementations. However, the operator implementation and registration APIs must be updated to use QHPI.

The following section provides examples showing how to create a new QHPI-based operator package or migrate an existing legacy package to QHPI. Several QHPI-based op implementation examples can be found in the SDK at:

${QNN_SDK_ROOT}/examples/QNN/OpPackage/HTP/QHPI/
    Copy to clipboard

### [QNN Operator Package Skeletal Generation](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id8)

QNN provides a streamlined way to create the required operator package interface implementation and supporting build files for generating QHPI-based operator packages using the `qnn-op-package-generator` tool.

To use the tool, package information and operators must be defined using the XML OpDef schema, as described in [QNN XML Op Def](https://docs.qualcomm.com/doc/80-63442-10/topic/op_def_schema.html).

To enable QHPI, update the XML configuration file by setting the `UseQHPI="true"` attribute in the `OpDefCollection` element, as shown below:

<?xml version="1.0" encoding="UTF-8"?>
    <OpDefCollection
       PackageName="ExampleOpPackage"
       Domain="aisw"
       Version="1.0"
       UseQHPI="true">
       ...
    </OpDefCollection>
    Copy to clipboard

Sample XML configurations can be found in [Example XML Op Def Configs](https://docs.qualcomm.com/doc/80-63442-10/topic/example_op_defs.html) and in the SDK at:

${QNN_SDK_ROOT}/examples/QNN/OpPackageGenerator
    Copy to clipboard

Based on the input/output data types and parameters specified in the XML configuration file, the `qnn-op-package-generator` tool creates a QHPI-based skeletal implementation using definitions provided in the SDK. The kernel tiling, execution, and other functions are stubbed out in the generated skeleton and must be implemented by the developer.

The generated skeletal implementation also includes the mandatory entry-point function `qhpi_init()`, which is required for successful loading and registration of a QHPI operator package.

Note

The `qhpi_init()` function is the QHPI equivalent of the legacy macro `INIT_PKG_CORE_INIT_FUNC`.

Given an XML configuration file, a skeletal implementation for the op package can be generated using the following command:

<QNN_SDK_ROOT>/bin/x86_64-linux-clang/qnn-op-package-generator --config <xml_file> --output_path <output_dir>
    Copy to clipboard

Further details on op package skeleton generation can be found at   [Generating Op Packages](https://docs.qualcomm.com/doc/80-63442-10/topic/generating_op_packages.html).

### [Operator Implementations](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id9)

The operator implementation is the primary area where QHPI APIs differ from the legacy C++ macro-based approach. The following sections provide details on QHPI operator implementation.

#### [Operator definition](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id10)

In the legacy interface, the `DEF_PACKAGE_OP` macro and its variants declare kernels and associate them with operators. For example:

template <typename Ttype>
    GraphStatus asin_opt(Ttype &out, const Ttype &in);
    
    DEF_PACKAGE_OP(asin_opt<QuantUint16Tensor>, "Asin_16");
    DEF_PACKAGE_OP(asin_opt<QuantUint16Tensor_TCM>, "Asin_16");
    DEF_PACKAGE_OP(asin_opt<QUint16CroutonTensor>, "Asin_16");
    DEF_PACKAGE_OP(asin_opt<QUint16CroutonTensor_TCM>, "Asin_16");
    Copy to clipboard

In this example, four kernels are associated with the operator `"Asin_16"`. These kernels differ in layout and memory placement of the input tensor. The macro uses C++ templates to interpret type signatures and match kernels to tensor types.

In QHPI, kernels are declared using a static data structure **QHPI\_Kernel\_vxxx**. The `QHPI_Kernel_vxxx` defines kernel attributes such as function name, resources, input/output signatures, and flags.

Example:

static QHPI_Kernel_v1 asin16_kernels[] = {{
       .function_name = THIS_PKG_NAME_STR "::" "asin_16_flat",
       .function = asin_16<QuantUint16Tensor>,
       .resources = QHPI_RESOURCE_HVX,
       .source_destructive = true,
       .min_inputs = 1,
       .input_signature = &sig_flat_16,
       .min_outputs = 1,
       .output_signature = &sig_flat_16,
    }, ... };
    Copy to clipboard

**QHPI\_Tensor\_Signature\_vxxx**

Captures tensor properties such as element type, layout, storage, and memory placement. The corresponding legacy macro equivalent is `DEF_TENSOR_PROPERTIES`.

Example:

static QHPI_Tensor_Signature_v1 sig_flat_16 = {
       .element_type = QHPI_QUInt16,
       .layout = QHPI_Layout_Flat4,
       .storage = QHPI_Storage_Direct,
       .mem_placement = QHPI_MemLoc_DDR_OR_TCM,
    };
    Copy to clipboard

**QHPI\_OpInfo\_vxxx**

Defines an operator and associates it with one or more kernels.

Example:

static QHPI_OpInfo_v1 ops[] = {{
       .name = THIS_PKG_NAME_STR "::" "Asin_16",
       .num_kernels = 2,
       .kernels = asin16_kernels,
    }, ... };
    Copy to clipboard

Note

Operator names follow the convention `PackageName::OperatorName`.

Kernels are matched in the order they appear in the operator definition by default.

#### [Kernel Implementation](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id11)

Every QHPI kernel defined using `QHPI_Kernel_vxxx` must include an execution function that handles kernel implementation for inferencing.

Example kernel execution signature:

template <typename Ttype>
    inline GraphStatus asin_opt(Ttype &out, const Ttype &in);
    
    template<typename TensorType>
    static uint32_t asin_16(QHPI_RuntimeHandle *,
                            uint32_t num_outputs, QHPI_Tensor **outputs,
                            uint32_t num_inputs, const QHPI_Tensor *const *inputs) {
       return asin_opt<TensorType>(*reinterpret_cast<TensorType *>(outputs[0]),
                                  *reinterpret_cast<const TensorType *>(inputs[0]));
    }
    Copy to clipboard

#### [Precomputation](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id12)

QHPI supports precomputation for optimal inferencing via additional function pointers in `QHPI_Kernel_v1`:

- **do\_precomputation\_function**

> 
> 
> Called during graph load to initialize a data block. This API replaces the legacy `COMPILER_FOR` macro. This function has access to tensor info such as shape, block table, and quantization parameters. Any computation based on this information may be done and stored in the data block for use later during graph inference.
- **function\_with\_precomputed\_data**

> 
> 
> Called during inference with the runtime handle and the precomputed data block from `do_precomputation_function`. This is an alternative to a kernel’s default execution function specified in `function`.
> 
> 
> Example:
> 
> 
> static QHPI_Kernel_v1 kernels[] = {{
>            .function_name = THIS_PKG_NAME_STR "::" "Asin_16",
>            ...
>            .precomputed_data_size = sizeof(Precompute),
>            .do_precomputation_function = asin_do_precomputation,
>            .function_with_precomputed_data = asin_use_precomputation,
>         }};
>         Copy to clipboard

#### [Multithreading](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id13)

- Enable by setting the `multithreaded` flag in `QHPI_Kernel_v1`.
- Enables multi-threaded execution of a kernel across multiple hardware threads.
- Access slice information for current thread and the total number of slices via runtime functions:

uint32_t num_slices = qhpi_num_slices(fh);
    uint32_t slice_number = qhpi_slice_number(fh);
    Copy to clipboard

#### [Source Destructive](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id14)

Specify `source_destructive = true` in `QHPI_Kernel_v1` if the **first** input and output tensors can share memory. Such a kernel must ensure that it reads the input before writing the corresponding output location.

Note

This optimization is opportunistic and the kernel must be written to run correctly when the tensors do not share the same memory location.

#### [Cost Function](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id15)

In the legacy APIs, the cost functions influenced both kernel selection and execution time prediction. In QHPI, however, they are **only used for predicting execution times**, and not for kernel selection. Please see [Predicates](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#predicates-details) for more on kernel selection.

Example:

float cost_func(const uint32_t num_inputs, const QHPI_Tensor *const *inputs) {
       QHPI_Shape shape = qhpi_tensor_shape(inputs[0]);
       unsigned size = shape.dims[0] * shape.dims[1] * shape.dims[2] * shape.dims[3];
       return size * 0.2f + 10.0f;
    }
    Copy to clipboard

#### [Optimization Rules](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id16)

QHPI replaces the `DEF_OPT` Domain Specific Language (DSL) with a simplified C API for graph rewrites. Op writers can implement the following **optional** callbacks for tiling:

- **early\_rewrite**

> 
> 
> Invoked before any tiling is performed by the compiler. This function can rewrite the operator into a new subgraph of operators. Operators in this subgraph may be QHPI operators or standard QNN operators.
> 
> 
> Example:
> 
> 
> static const QHPI_Op *relu_to_relu_minmax_quant_rewrite(const QHPI_Op *op) {
>            QHPI_OpRef input = qhpi_op_input(op, 0);
>            QHPI_OutputDef input_output = qhpi_op_output(input.op, input.output_number);
>         
>            // Check if input is quantized type
>            if (input_output.type != QHPI_QUInt8 && input_output.type != QHPI_QUInt16 &&
>               input_output.type != QHPI_QInt8 && input_output.type != QHPI_QInt16) {
>               return op;
>            }
>         
>            // Create ReluMinMax with min=0.0f, max=INF
>            QHPI_OpRef min_const = gen_const_scalar_f32(op, 0.0f);
>            QHPI_OpRef max_const = gen_const_scalar_f32(op, INFINITY);
>         
>            QHPI_OpRef inputs[] = {input, min_const, max_const};
>            QHPI_OutputDef output = qhpi_op_output(op, 0);
>         
>            return qhpi_op_create(op, THIS_PKG_NAME_STR "::ReluMinMax", 3, inputs, 1, &output);
>         }
>         Copy to clipboard
- **late\_rewrite**

> 
> 
> This allows the op package to rewrite operators after tiling into a new subgraph. In this case, the new subgraph should only contain plugin operators and a small set of additional operators such as <cite>Slice_shape</cite>, <cite>Concat</cite>, and <cite>Reshape</cite>. The late rewrite can also be used to introduce scratch space after tiling as unused outputs.
> 
> 
> Example:
> 
> 
> static const QHPI_Op *relu_late_rewrite(const QHPI_Op *op) {
>            // Use late rewrite to add scratch
>            if (qhpi_op_num_outputs(op) > 1)
>               return op;
>            QHPI_OutputDef outputs[2];
>            outputs[0] = qhpi_op_output(op, 0);
>            outputs[1] = {.type = QHPI_Int32,
>                           .shape = {.rank = 4, .dims = {1, 1, 1, 32}}};
>            QHPI_OpRef input = qhpi_op_input(op, 0);
>            return qhpi_op_create(op, qhpi_op_name(op), 1, &input, 2, outputs);
>         }
>         Copy to clipboard
> 
> 
> The explicit phase ordering supported in `DEF_OPT` is replaced by a simpler pre/post callback functions for tiling.

#### [Tiling](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id17)

QHPI supports direct callbacks into a centralized tiling algorithm. This algorithm makes decisions on how to create smaller versions of operators. By hooking into our central tiler, you can enhance parallelism across functional units and minimize peak memory footprint so as to remain in TCM and improve end-to-end latency. Concurrently, the central tiler also weighs the costs that come from over-decomposition of operators by avoiding excessive inter-op communication and aligning chunking dimensions when possible to minimize concatenation and slicing. This results in the central tiler choosing chunk sizes for every operator it processes.

It is strongly recommended that users opt-in to these callbacks by (at a minimum) creating a `build_tile` function.

During tiling there are several functions which may be defined and used to drive our choices on how we split an QHPI operator in our graph. These include:

- **shape\_required**

> 
> 
> Callback which is passed an instance of a plugin operator and returns a \_shape\_ object that forces certain sizes on each tiling dimension. This function is optional and if omitted no restrictions are placed at the start of tiling.
> 
> 
> Example:
> 
> 
> static QHPI_Shape relu_shape_required(const QHPI_Op *op) {
>            // Define tiling requirements - split on height dimension
>            static QHPI_Shape required = {
>               .rank = 4,
>               .dims = {1, RELU_TILE_HEIGHT, 0, RELU_CHANNEL_SPLIT_SIZE}
>            };
>            return required;
>         }
>         Copy to clipboard
- **shape\_legalized**

> 
> 
> Callback which is passed an instance of a plugin operator and a candidate tile shape. The function then returns a “legal” tile shape after considering the initally-proposed one from central tiling’s heuristics. This is intended to support scenarios where there are operator-specific requirements on the shape (e.g. some dimension must be a multiple of some value for good performance). This function is optional and if omitted no restrictions are assumed beyond what is provided by a potential shape required function.
> 
> 
> Example:
> 
> 
> static QHPI_Shape relu_shape_legalized(const QHPI_Op *op) {
>            static QHPI_Shape legal = {
>               .rank = 4,
>               .dims = {1, 8, 0, 256}
>            };
>            ...
>            return legal;
>         }
>         Copy to clipboard
- **build\_tile**

> 
> 
> Callback which is passed an instance of a plugin operator, a starting location, and an extent of the first output, is expected to create a new instance of the operator to compute that particular output tile. The key aspect of this is to determine the new inputs to this operator which will (typically) be slices of the inputs of the original. This function is required if you’d like to have your op split into smaller chunks. When omitted, the op is passed back without splitting.
> 
> 
> Example:
> 
> 
> static const QHPI_Op *relu_build_tile(const QHPI_Op *op,
>                                             const QHPI_Shape *out_start,
>                                             const QHPI_Shape *out_extent) {
>            // Get input reference
>            QHPI_OpRef input_ref = qhpi_op_input(op, 0);
>         
>            // For ReLU, input and output have same dimensions, so input slice = output slice
>            QHPI_Shape in_start = *out_start;
>            QHPI_Shape in_extent = *out_extent;
>         
>            // Create input slice
>            QHPI_OpRef input_slice = qhpi_op_slice(input_ref, &in_start, &in_extent);
>         
>            // Build tiled operator with sliced input
>            QHPI_OpRef inputs[] = {input_slice};
>         
>            QHPI_OutputDef outputs[] = {
>               {.type = qhpi_op_output(op, 0).type,
>                  .quant_parameters = qhpi_op_output(op, 0).quant_parameters,
>                  .shape = *out_extent}
>            };
>         
>            return qhpi_op_create(op, qhpi_op_name(op), 1, inputs, 1, outputs);
>         }
>         Copy to clipboard

These tiling callbacks can be invoked several times during prepare to perform chunk size evaluation and generate the new sub-operations.

#### [Predicates](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id18)

By default, QHPI matches kernels based on tensor signatures in the order of kernel specification at op registration time. The op writer can, however, further influence kernel matching by implementing an optional `predicate` callback function. The `predicate` callback function can be written to return true or false to either select or skip over the kernel for the next one.

Example:

uint32_t asin_plugin_default_predicate(const QHPI_Op *op, const uint32_t num_inputs, const QHPI_Tensor *const *inputs)
    {
       if (num_inputs == 0) {
          return 0u; // false
       }
       for (uint32_t i = 0; i < num_inputs; i++) {
          if (inputs[i] == nullptr) {
                return 0u; // false
          }
          QHPI_Shape shape = qhpi_tensor_shape(inputs[i]);
          for (uint32_t d = 0; d < shape.rank; d++) {
                if (shape.dims[d] == 0) {
                   return 0u; // false
                }
          }
          QHPI_Quant_Parameters qp = qhpi_tensor_quant_parameters(inputs[i]);
          if (qp.stepsize == 0.0f) {
                return 0u; // false
          }
       }
       return 1u; // non-zero => "true"
    }
    Copy to clipboard

### [Operator registration](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_qhpi.html#id19)

QHPI operators defined using `QHPI_OpInfo_vxxx` can be registered with the QNN HTP BE using the corresponding registration function `qhpi_register_ops_vXXX` API as part of the op package dynamic library entry point `qhpi_init()`.

Example:

// OpInfo definitions
    static QHPI_OpInfo_v1 ops[] = {
       {
          .name = THIS_PKG_NAME_STR "::Relu",
          .num_kernels = 2,
          .kernels = relu_kernels,
          .early_rewrite = relu_to_relu_minmax_quant,
          .shape_required = relu_shape_required,
          .build_tile = relu_build_tile,
       },
       // ...
    };
    
    // Registration function for regular ReLU operations
    void register_relu_ops()
    {
       qhpi_register_ops_v1(sizeof(ops) / sizeof(ops[0]), ops, THIS_PKG_NAME_STR);
    }
    
    extern "C" const char *qhpi_init()
    {
       // Register the ops with HTP BE
       register_relu_ops();
       return THIS_PKG_NAME_STR;
    }
    Copy to clipboard

The next step after creating and building a QHPI op package is to build and execute a model that uses operators implemented in the op package. The steps for building and executing a model using QNN HTP BE have **not** changed due to QHPI, and are outlined in [Custom op package tutorial](https://docs.qualcomm.com/doc/80-63442-10/topic/tutorial1.html).

Last Published: Jun 04, 2026

[Previous Topic
HTP API Usage Guidelines](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/htp_api_usage_guidelines.md) [Next Topic
QNN HTP Op Package - Common Default Package Ops Usage Examples](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/common_default_package_ops_usage_examples.md)