# Implementing Ops

## Overview

This document describes implementing ops in context of QNN HTP op package
in details. The following sections are covered:

1. [Writing a New Op](https://docs.qualcomm.com/doc/80-63442-10/topic/implementing_ops.html#Writing-a-New-Op)
2. [Steps for Implementing an Op](https://docs.qualcomm.com/doc/80-63442-10/topic/implementing_ops.html#Steps-for-Implementing-an-Op)
3. [Considerations for
Transformations](https://docs.qualcomm.com/doc/80-63442-10/topic/implementing_ops.html#Considerations-for-Transformations)
4. [Tiling as Graph
Transformations](https://docs.qualcomm.com/doc/80-63442-10/topic/implementing_ops.html#Tiling-as-Graph-Transformations)
5. [Adding scratch memory to an op](https://docs.qualcomm.com/doc/80-63442-10/topic/implementing_ops.html#Adding-scratch-memory-to-an-Op)
6. [Tips for Optimization](https://docs.qualcomm.com/doc/80-63442-10/topic/implementing_ops.html#Tips-for-Optimization)

## Writing a New Op

The internal representation of a QNN graph is a sequence of `OpDef`
nodes that represent the entire graph. Some of `OpDef` nodes represent
graph calculations, with one or more inputs; the operation is determined
by the name given to the `OpDef`. An `OpDef` class expresses the
definition of an operation and lists the inputs to the op (`InputDef`)
and the output characteristics of the op (`OutputDef`).

`OpDef` can represent `const` data (for representing parameters,
weights, etc.) or `shapes` (which don’t have data but still need a
means to be expressible as a shape)

The graph preparation phase starts after the graph construction phase.
During the prepare stage, a set of optimization rules are applied based
on priority, which cause transformations to be made to the graph: parts
of the graph are removed and replaced by other arrangement of `OpDef`.
The rules are organized into passes. Each pass attempts to match a
sequence of nodes and applies the transformation on the sequence if the
constraint is satisfied.

User needs to set a priority number for each ops to indicate the order
in which optimizations should be applied. All the `OpDefs` in the
graph are loop through and optimizations rules are checked against one
another. The loop ends when none of the `OpDefs` in the graph can be
optimized using optimization rules from the current pass. Adjustments
and replacements to the `opDef` is applied only when possible.

## Steps for Implementing an Op

Writing ops can be categorized into four-step process.

### *Step 1*: Op Implementation

User needs to implement the functionality of the op. Here is a simple
example of element-wise addition:

int elementwise_add(Tensor &out, const Tensor &a, const Tensor &b)
    {
        out.set_dims(a);
        auto [aB,aH,aW,aD] = a.dims();
        for (Idx b = 0; b < aB; b++) {
            for (Idx h = 0; h < aH; h++) {
                for (Idx w = 0; w < aW; w++) {
                    for (Idx d = 0; d < aD; d++) {
                        out(b,h,w,d) = a(b,h,w,d) + b(b,h,w,d);
                    }
                }
            }
        }
        return 0;
    }
    Copy to clipboard

This is just a basic reference op implementation - more improvements are
provided later in the doc. Let’s look at some of the details.

The op implementation function parameter list consists of a series of
HTP core tensors in the following order: `outputs inputs parameters`. Input tensors
and parameter tensors shall be marked as const. Please note, in implementation
fucntions, there is no separation between input tensors and parameters, they are
both considered inputs in HTP core. Also, both
QNN scalar and tensor parameters are converted into HTP core tensors. In addition,
HTP core tensors are always 4 dimensions, and the layout is always bhwc. QNN tensors
with lower dimensions are backfilled into 4-dimensional HTP core tensors. Op
implementation functions shall return GraphStatus which is an enum defined in
include/HTP/core/graph\_status.h in QNN SDK.

HTP core has a base tensor type `Tensor` and a bunch of `ConcreteTensor` types.
`ConcreteTensor` types are derived from base `Tensor`, and each ConcreteTensor
type has a fixed rank, memory layout and data type. Base `Tensor` can be used
in generic op implementations and served as a fallback option. `ConcreteTensor`
types can be used to specialize op implementations for faster performance purpose.

In this implementation, there is no parameter, and generic `Tensor` is used for
both input and outputs. Users can access elements from these generic tensors using
parentheses. Regardless of the underlying types of tensors, the element access
interface type is float.

There is a downside to this generic approach: it is **slow**! Every
element access needs to call some functions to find a location, and
another set of functions to decode/encode the value.

Fortunately, if more visibility is provided to the compiler about the
nature of the tensors, this overhead can be greatly reduced.

int elementwise_add_faster(PlainFloatTensor &out, const PlainFloatTensor &a, const PlainFloatTensor &b)
    {
        out.set_dims(a);
        auto [aB,aH,aW,aD] = a.dims();
        for (Idx b = 0; b < aB; b++) {
            for (Idx h = 0; h < aH; h++) {
                for (Idx w = 0; w < aW; w++) {
                    for (Idx d = 0; d < aD; d++) {
                        out(b,h,w,d) = a(b,h,w,d) + b(b,h,w,d);
                    }
                }
            }
        }
        return 0;
    }
    Copy to clipboard

Here is an slightly optimized version using a ConcreteTensor type
`PlainFloatTensor`, which is a tensor holding `float`
values and a flat memory layout. The compiler now has the visibility to
eliminate the function calls for accessing each element and decoding it,
so this implementation is more efficient than the previous
implementation.

For a list of HTP core tensor types and their accessor functions, please refer
to include/HTP/core/tensor.h in QNN SDK.

More descriptions about HTP memory layouts and tensors can be found in
tensors\_and\_memory\_layout.html.

How does the infrastructure choose what op to select? In this case, if
the input tensors and output tensors of the node of this current op  type are
all `PlainFloatTensor`, then both implementation can be work, the
implementation registered with relatively lower cost will be selected.
If any of the input tensors or output tensors of the node of this current
op type are not `PlainFloatTensor`, then the first implemenation with
generic `Tensor` type will be the fall back option.

It is important to understand that reference ops can be called when the
operands do not match the optimized implementation. Generic
implementations can be desirable or can be implemented as always failing
to catch problems in an input graph or with the optimization process.

`elementwise_add` and `elementwise_add_faster` implementations are
extremely similar in implementation. These functions could be refactored
to be **one** templatized function:

template<typename TType>
    int elementwise_add(TType &out, const TType &a, const TType &b)
    {
        out.set_dims(a);
        auto [aB,aH,aW,aD] = a.dims();
        for (Idx b = 0; b < aB; b++) {
            for (Idx h = 0; h < aH; h++) {
                for (Idx w = 0; w < aW; w++) {
                    for (Idx d = 0; d < aD; d++) {
                        out(b,h,w,d) = a(b,h,w,d) + b(b,h,w,d);
                    }
                }
            }
        }
        return 0;
    }
    Copy to clipboard

To further optimize any op, HVX can be used to achieve parallel calculations.

### *Step 2*: Op Registration

Op implementation functions need to be registered with an op name, op cost and flags.
Op registration can be achieved using HTP core macros listed below, and these macros
should be placed in global scope in individual op implementation source files.

#### Method 1

Registration with default cost value (i.e. GLACIAL) and default flag (Flags::RESOURCE\_HVX)

**Syntax**

/*
     * F  - op implementation function
     *
     * OP - op name
     */
    DEF_PACKAGE_OP(F,OP)
    Copy to clipboard

**Example**

DEF_PACKAGE_OP(elementwise_add<Tensor>, "Add")
    Copy to clipboard

#### Method 2

Registration with user specified cost value and flags.

**Syntax**

/*
     * F    - op implementation function
     *
     * OP   - op name
     *
     * COST - pre-defined cost value names, one of GLACIAL, SNAIL, FAST, FREE
     *        (listed in descending order of value).
     *        Op implementation with relatively lower cost will be chosen given all
     *        other criteria are met.
     *
     * ...  - zero or more flags, available flags include IS_CONST, INHIBIT_CONST_PROP,
     *        RESOURCE_HVX.
     *        IS_CONST is used to mark an op should be treated as a constant op.
     *        INHIBIT_CONST_PROP marks an op should not participate in constant propagation.
     *        RESOURCE_HVX marks this op will use HVX resources.
     */
    DEF_PACKAGE_OP_AND_COST_AND_FLAGS(F,OP,COST,...)
    Copy to clipboard

**Example**

DEF_PACKAGE_OP_AND_COST_AND_FLAGS (
                elementwise_add<PlainFloatTensor>,
                "Add",
                SNAIL,
                RESOURCE_HVX)
    Copy to clipboard

#### Method 3

Registration with user specified cost function and flags.

**Syntax**

/*
     * F      - op implementation function
     *
     * OP     - op name
     *
     * COST_F - user defined cost function
     *          cost function pointer type: typedef float (*cost_function) (const Op * op);
     *          Op implementation with relatively lower cost will be chosen given all
     *          other criteria are met.
     *
     * ...    - zero or more flags, available flags include IS_CONST, INHIBIT_CONST_PROP,
     *          RESOURCE_HVX.
     *          IS_CONST is used to mark an op should be treated as a constant op.
     *          INHIBIT_CONST_PROP marks an op should not participate in constant propagation.
     *          RESOURCE_HVX marks this op will use HVX resources.
     */
    DEF_PACKAGE_OP_AND_COST_F_AND_FLAGS(F,OP,COST_F,...)
    Copy to clipboard

**Example**

float elementAddCost(const Op *op) {
      // can use some properties of an op to determine cost
      return 0.0;
    }
    
    DEF_PACKAGE_OP_AND_COST_F_AND_FLAGS (
                elementwise_add<PlainFloatTensor>,
                "Add",
                elementAddCost,
                RESOURCE_HVX)
    Copy to clipboard

### *Step 3*: Specify the DEF\_TENSOR\_PROPERTIES for operators

We can achieve centralizing the decision-making on the Layout and Memory Placement of our tensors by specifying the requirements and constraints for operators in DEF\_TENSOR\_PROPERTIES.
Here is an example:

DEF_TENSOR_PROPERTIES(Op("Argmax", "in", "axis"),
                         Flat("*", "axis"),
                         MainMemory("..."))
    Copy to clipboard

In this example, the first literal, <cite>“Argmax”</cite>, is the name of the operator and the remaining literals provide local names to the tensor inputs.
\* We use <cite>“*”</cite> to refer to the (first) output tensor.
\* <cite>“in”</cite> identifies the first input tensor but is not mentioned in any layout constraint and so may have either flat or crouton layout.
\* <cite>“axis”</cite> identifies the parameter tensor and require to be in flat layout.
\* The ellipsis refers to “all tensors not yet constrained” so in this case all tensors are constrained to be in main memory.

#### Constraint terms:

- Flat: flat layout
- Crouton: crouton layout
- Tcm: in TCM
- MainMemory: in main memory.

These are constraints: if a tensor is not mentioned for some property then the system will assign a property for it. In this example “in” identifies the first input tensor but is not mentioned in any layout constraint and so may have either flat or crouton layout. The ellipsis refers to “all tensor not yet constrained on the relevant property” so in this case all tensors are constrained to be in main memory.

#### Before and after TCM Migration

// DEF_OPT rules we have in the old way
    DEF_PACKAGE_OPTIMIZATION(LATE+900,
        Op("FastExampleOp", "in_0", "in_1", "in_2"),
        OK,
        Op("FastExampleOp",
            Op(FROM_DEFAULT_PACKAGE("crouton_to_vtcm"), Op(FROM_DEFAULT_PACKAGE("ForceFormat_Crouton"), "in_0")),
            Op(FROM_DEFAULT_PACKAGE("flat_to_vtcm"), Op(FROM_DEFAULT_PACKAGE("ForceFormat_Flat"),"in_1")),
            "in_2"
        )
    )
    Copy to clipboard

Now instead of using flat/crouton\_to\_vtcm, we use DEF\_TENSOR\_PROPERTIES to control layout and placement

//The new way
    DEF_TENSOR_PROPERTIES(Op("FastExampleOp","in0","in1","in2"),
        Flat("*","in1","in2"),
        Crouton("in0"),
        MainMemory("in2"),
        Tcm("in_0", "in_1"))
    Copy to clipboard

- in0 should be in Crouton and tcm memory
- in1 should be in Flat and tcm memory

For facilitating migration to the new way, a helper script is provided under examples/customer\_migration\_tool/customer\_migration.py in QNN SDK. More examples can be find under examples/QNN/OpPackage/HTP

### *Step 4*: Op Parameter Order Specification - *Optional*

From QNN level, some ops uses parameters in addition to inputs, and there
might be more than one parameter. Parameters are constants provided as
part of Qnn\_OpConfig\_t, and Qnn\_Param\_t has a name field associated with it.
Due to the nature of HTP op implementation function interface, the
parameters are differentiated based on order rather than names. To allow
QNN users to use op parameter names during QnnGraph\_addNode function calls,
HTP backend allows op package writers to specify op parameter orders as well
as default their values for any ops. HTP backend does re-arrangement of the
op parameters used in QnnGraph\_addNode based on the order listed. If an op
does not have an op parameter order specification, no re-arrangement occurs
in QnnGraph\_addNode.

This is not applicable to the elementwise\_add op mentioned above.

Op parameter order can be specified using HTP core macro listed below,
and this macro should be placed in global scope in individual op implementation
source files.

**Syntax**

/*
     * OP       - op name
     *
     * PARAM    - parameter name
     *
     * MANATORY - boolean, whether this parameter is required to be provided at Qnn_addNode
     *
     * DEFAULT  - default parameter value as Qnn_Param_t*, is used when MANATORY is false.
     *            If provided as Qnn_Param_t*, DEFAULT will be used for graph construction
     *            when this parameter is not provided at Qnn_addNode.
     *            If provided as nullptr, graph construction will skip this parameter when
     *            this parameter is not provided at Qnn_addNode.
     */
    DEF_PACKAGE_PARAM_ORDER(OP,PARAM1,MANDATORY1,DEFAULT1,PARAM2,MANDATORY2,DEFAULT2...)
    Copy to clipboard

This macro is one per op and it takes any number of parameters. If an op has
a parameter order definition, any parameter passed into Qnn\_addNode with
unlisted name will be abandoned. If two or more op packages with the same
package name will be registered, they cannot list conflicting parameter orders.

**Example**

static Qnn_Scalar_t sg_opParamDefault1Scalar{.dataType = QNN_DATATYPE_FLOAT_32, .floatValue = 6.0};
    static Qnn_Param_t sg_opParamDefault1{.paramType   = QNN_PARAMTYPE_SCALAR,
                                          .scalarParam = sg_opParamDefault1Scalar};
    DEF_PACKAGE_PARAM_ORDER("paramOrderDemoOp",
                            "MinusVal",
                            false,
                            &sg_opParamDefault1,
                            "AxisVal",
                            true,
                            nullptr,
                            "AddVal",
                            true,
                            nullptr,
                            "OptionalParam",
                            false,
                            nullptr)
    Copy to clipboard

This example defines op parameter order for op `paramOrderDemoOp` from the current
op package. It expects four parameters in the order of “MinusVal”, “AxisVal”, “AddVal”,
“OptionalParam”. “MinusVal” is an optional parameter with a default scalar parameter
value defined in sg\_opParamDefault1. “AxisVal” and “AddVal” are mandatory parameters.
“OptionalParam”  is optional and will be skipped if not provided in Qnn\_addNode.

### *Step 5*: Optimization Rule Definition - *Optional*

Once the basic functionality of the op is completed; the user might want
to specify rules for graph level transformations to implement things
like tiling strategies or manipulating the data to simplify execution.
The transformations should be applied in a way such that, originality of
the output does not deteriorate.

Optimization rules can be defined using HTP core macro listed below,
and this macro should be placed in global scope in individual op
implementation source files.

**Syntax**

/*
     * PRIORITY       - unsigned integer value, used for indicating optimization pass number,
     *                  smaller number indicates earlier optimization pass.
     *                  Predefined values include EARLY(2000), MIDDLE(3000), LATE(4000).
     *
     * MATCHCODE      - subgraph matching pattern which this optimization rule should apply on
     *
     * CONSTRAINTCODE - constraints applied to the match pattern
     *
     * REPLACECODE    - new subgraph pattern which should replace the matching pattern if the
     *                  constraints are met
     */
    DEF_PACKAGE_OPTIMIZATION(PRIORITY,MATCHCODE,CONSTRAINTCODE,REPLACECODE)
    Copy to clipboard

**Example**

DEF_PACKAGE_OPTIMIZATION(
        EARLY,
        Op("Add","X","B"),
        AND(EQ(RANK_OF("X"),4),EQ(RANK_OF("B"),4),EQ(DIM_DEPTH("X"),DIM_DEPTH("B")),
            EQ(DIM_WIDTH("B"),1),EQ(DIM_HEIGHT("B"),1),EQ(DIM_BATCHES("B"),1)),
        Op("BiasAdd","X","B")
    )
    Copy to clipboard

This rule is ordered `EARLY`, which is a value that happens early in
the optimization process. `EARLY, MIDDLE, and LATE` are defined to
help order rules globally. User may wish to use
`EARLY, EARLY+1, EARLY+2, etc.` to order optimizations.

This matches the pattern of an op with the operation string `Add` with
two inputs. The constraint ensures that the inputs are `4D`, that the
last dimensions match between the two inputs, and that the other
dimensions in the `B` input are 1.

If the constraint passes, the original op is replaced with a new op,
with the same inputs and same output specifications as the original
output but with the operation string replaced with `BiasAdd`.

Note that the strings supplied as inputs in the match are usable during
constraint and replacement patterns to indicate whatever was matched.
Additionally, there is a special string `"*"` which indicates the
entire match. If a placeholder string occurs more than once in a match,
it must be the same in all places.

Op specifiers may be used more than once in a match or replacement
pattern, this will match or generate more than one op in the dependent
manner expected. For example:

DEF_PACKAGE_OPTIMIZATION(
        EARLY+1,
        Op("BiasAdd",Op("Conv2d_valid","Activations","Weights","Stride"),"Bias"),
        OK,
        Op("ConvLayer_valid","Activations","Weights","Stride","Bias")
    )
    Copy to clipboard

This rule will match the sequence of `Conv2d_valid` followed by
`BiasAdd`into a new op called `ConvLayer` with four input
parameters.

**Cross-Package Optimization**

Cross-package optimization is allowed and supported. That means any op package can define optimization
rules which involve ops from other op packages. By default, all the op
names used in matching patterns and replacement patterns are assigned with
a package name associated with the current package. In scenarios when an op
from a different op package shall be used in the matching pattern and/or
replacement pattern, users can explicitly use this format `packageName::opName`
in place where an op name is expected. If users want to use any HTP native ops,
`FROM_DEFAULT_PACKAGE(OPNAME)` macro can be used to indicate that. For example,

DEF_PACKAGE_OPTIMIZATION(
        EARLY+1,
        Op("BiasAdd",Op("OpPackageNo2::Conv2d_valid","Activations","Weights","Stride"),"Bias"),
        OK,
        Op(FROM_DEFAULT_PACKAGE("ConvLayer_valid"),"Activations","Weights","Stride","Bias")
    )
    Copy to clipboard

This modifies the previous optimization rule to match Op `Conv2d_valid` from
a package named `OpPackageNo2`, and it modifies the replacement pattern with
a HTP native `ConvLayer_valid` op.

Other common default package ops examples: please read [QNN HTP Op Package - Common Default Package Ops Usage Examples](https://docs.qualcomm.com/doc/80-63442-10/topic/common_default_package_ops_usage_examples.html#native-ops-usage).

**More Complex Optimization Rule Example**

User can even take this further and create a more complex optimization rule.
For example:

DEF_PACKAGE_OPTIMIZATION(
        LATE,
        Op("ConvLayer_valid","Act","Weights","Stride","Bias"),
    
        AND(IS_QUINT8("Act"),       // the constraint
            IS_QUINT8("Weights"),
            EQ(DIM_HEIGHT("Stride"),2),
            EQ(DIM_WIDTH("Stride"),2),
            LT(int(DIM_DEPTH("Act")),4),
            GT(int(DIM_NFILTS("Weights")),31)),
    
        Op("ConvLayer_valid",       // the replacment rule
            WITH_TYPE("Act",
                WITH_SIZE(
                     gen_Shape(
                         DIM_BATCHES("Act"),
                         ADD(1,DIV(SUB(DIM_HEIGHT("Act"),DIM_FILTHEIGHT("Weights")),2)),
                         ADD(1,DIV(SUB(DIM_WIDTH("Act"),DIM_FILTWIDTH("Weights")),2)),
                         ROUNDUP(MUL(DIM_FILTHEIGHT("Weights"),DIM_FILTWIDTH("Weights"),DIM_FILTDEPTH("Weights")),32)
                     ),
                    Op("ConvLayer.opt.im2col_stride2","Act",gen_ShapeOf("Weights"))
                )
            ),
            WITH_TYPE("Weights",
                WITH_SIZE(
                    gen_Shape(
                        1,
                        1,
                        ROUNDUP(MUL(DIM_FILTHEIGHT("Weights"),DIM_FILTWIDTH("Weights"),DIM_FILTDEPTH("Weights")),32),
                        DIM_NFILTS("Weights")),
                    Op("ConvLayer.opt.weights_for_im2col","Weights")
                )
            ),
            gen_Shape(1,1,1,1),
            "Bias"
        )
    )
    Copy to clipboard

This has a simple match pattern (“`ConvLayer_valid`” with `4`
inputs) but there is a constraint which must be met before the
replacement rule is applied:

- `Act` and `Weights` inputs must both be of datatype
“`Quint8`”
- `Stride` must have dimension of `2x2`
- `Act` depth must be `< 4`
- `Weights` must have `DIM_NFILTS > 31` (meaning its depths &gt; 31)

The replacement pattern for the optimization above generates the op

Op( "ConvLayer_valid", <<new_act>>, <<new_weights>, <<new_stride>>, "Bias")
    Copy to clipboard

where `<<new_act>>, <<new_weights>, <<new_stride>>` are constructed as
below:

- `<<new_stride>>` is just a `[1x1x1x1]` shape, produced by the
`gen_Shape(1,1,1,1)`
- `<<new_act>>` is made by applying the original “`Act`” input to
an op
`Op("ConvLayer.opt.im2col_stride2","Act",gen_ShapeOf("Weights"))`.
In other words, a new op `ConvLayer.opt.im2col_stride2` is inserted
and “`Act`” becomes its first input which rearranges the data so
that the equivalent convolution can be done with a point-wise
convolution. It reduces the `height` and `width` dimensions by
`/2`, and increases the depth because of `WITH_TYPE` and
`WITH_SIZE`:

    - the output type for `<<new_act>>` is the same as the original
`Act` output type
    - the shape for `<<new_act>>` is according to the constructed
shape

gen_Shape(
                DIM_BATCHES("Act"),
                ADD(1,DIV(SUB(DIM_HEIGHT("Act"),DIM_FILTHEIGHT("Weights")),2)),
                ADD(1,DIV(SUB(DIM_WIDTH("Act"),DIM_FILTWIDTH("Weights")),2)),
                ROUNDUP(MUL(DIM_FILTHEIGHT("Weights"),DIM_FILTWIDTH("Weights"),DIM_FILTDEPTH("Weights")),32)
            Copy to clipboard

> 
> 
> Note that `gen_ShapeOf("Weights")` looks at the output shape
> of the `Weights` input and creates a constant `shape`
> object of the same shape.
- `<<new_weights>>` is similarly made by using the original
`Weights`’ input to an op
`Op("ConvLayer.opt.weights_for_im2col", "Weights")` and the output
shape is calculated as `[1,1,d_in,d_out]`, where `d_in` is the
same as the output depth of the `ConvLayer.opt.im2col_stride2`, and
`d_out` is the same as the original output depth of the weights.

Just from a simple element-wise add, it is possible to many generate complex
ops.

## Considerations for Transformations

By default, replacement ops are created with the same output parameters
(shape and quantization information) as the entire rule. This is not
always appropriate, especially if the user is manipulating part of an
input sequence.

- `WITH_SIZE(Size,Replacement)` generates Replacement with the same
size as `Size`.
- `WITH_TYPE(Type,Replacement)` generates Replacement with the same
type as `Type`.

For example, if someone is trying to manipulate a parameter of an input,
they might have a rule like:

DEF_PACKAGE_OPTIMIAZATION(
            EARLY,
            Op("MyOp","Input0","Input1"),
            OK,
            Op("MyOp.real","Input0",
                WITH_SIZE("Input1",
                    WITH_TYPE("Input1",
                        Op("MyOp.AdjustInput","Input1"))))
    )
    Copy to clipboard

This would replace the sequence `MyOp(A,B)` with
`MyOp.real(A,MyOp.AdjustInput(B))`, but would keep the size and type
of the second op the same as the `B` input.

To generate a shape or constant scalar op, there are some helper
replacement patterns:

- `gen_Shape(A,B,C,D)` generates a `4D` shape.
- `gen_ConstScalar_f32(val)` generates a constant float scalar with
value val.
- `gen_ConstScalar_i32(val)` generates a constant integer scalar with
value val.

## Tiling as Graph Transformations

> 
> 
> Turn some ops to smaller ops to help with practicality, And those ops
> to smaller ops until locality is achieved

For tiling, it is common to want to replace an op with the concatenation
of a set of smaller ops. To facilitate this, the `AUTOSPLIT`
replacement pattern helper will set this up. `AUTOSPLIT` takes as
parameters:

- The **output dimension to split on**
- **A variable** to hold information about the splitting process for a
replacement
- The **size of the split**
- The **replacement pattern**

For example,

DEF_PACKAGE_OPTIMIAZATION(
                EARLY+4,
                Op("MaxPool_valid","Act","W","S"),
                GT(DIM_DEPTH("*"), 32),
                AUTOSPLIT(3,"I",32,Op("MaxPool_valid",TYPICAL_SLICE("Act","I"),"W","S"))
    )
    Copy to clipboard

This will match a `MaxPool_valid` op with three inputs, enforce that
the number of output channels is greater than `32`, and then split
along the output channels into some number of replacements, each with at
most `32` channels. The replacement pattern is a `MaxPool_valid` op
where the input is replaced with a slice of input with the helper
`TYPICAL_SLICE`, which takes a slice of the specified input. All the
replacement ops are then concatenated together along the split dimension
automatically. If our input/output is `64` channels, the replacement
would look like:

Concat(3,
        MaxPool_valid(
            Slice("Act", /* Slice control values here */)
            "W","S"),
        MaxPool_valid(
            Slice("Act", /* Slice control values here */)
            "W","S"))
    Copy to clipboard

If a typical slice doesn’t match what is needed, use
`AUTOSPLIT_SHAPEFN_APPLY` helper to apply a user-specified function to
generate the shape needed.

Breaking down ops to smaller ops helps the framework to be able to
reduce the memory footprint (by more quickly eliminating temporary
results), as well as increase parallelism by enabling ops to run in
parallel.

It’s important to note that when the graph transformations are applied,
manipulations happen to the same graph, converting one valid graph to
another.

## Adding scratch memory to an Op

If an op implementation needs intermediate scratch memory, we can add it as
an additional temporary ouput using a constructor hook. The ‘scratch output’
feature allows additional output tensors to be added after the final input
tensor (and, where applicable, before the special parameters).

These ‘extra’ outputs do not correspond to outputs of the OpDef; when selecting
an Op for conversion from OpDef, these are not considered. They are required to
be of ‘concrete’ tensor class. Crouton or ‘flat’ tensor types can be used; in TCM
, or non-TCM.

When an Op is generated which has at least one scratch output, an
‘op constructor hook’ is used to set the shape of the scratch outputs. The
type of the tensor is determined by the class of the reference. The ‘lifetime’
of memory allocated to the block is from the time the op starts running until the
time it completes. If the tensor is of a ‘quantized’ type, the output tensor is
always created with step = 1.0, and offset = 0, there is no way to change this.

When the Op is first constructed, the scratch outputs are all constructed with
shape (1,1,1,1). The constructor hook is expected to ‘resize’ to the desired size.
The function change\_output\_tensor\_shape is given a Op &, and the index of the
output to be changed (which must be a tensor of supported type), and an array of
dimensions. The length of the array must be equal to, or less than, the rank of the
tensor. If it is less, extra ‘1’ dimensions will be added on the left.

Here is an example; the Op here needs the ‘scratch’ tensor shape to be [1,8,w,32]
where ‘w’ is the width of the output.

// In the example, this op function is to be registered with
    //    TensType = QUint8CroutonTensor and QUint8CroutonTensor_TCM
    
    template <typename TensType>
    int example_func(TensType &out, TensType const &in, Tensor const &parms, TensType &scratch_out)
    {
        ...
    }
    
    #ifndef PREPARE_DISABLED
    
    namespace {
    class ExampleFuncConstructorHook : public hnnx::OpHookBase {
        // This is called after the output tensors are created, but before allocation.
        virtual GraphStatus pre_allocate(hnnx::OpIoPtrs const &iop, Op &op) const override
        {
            Tensor const *const out0 = op.get_output(0);
            size_t new_dims[3] = {8, out0->dim(2), 32};
            // change the shape of output 1
            GraphStatus result = hnnx::change_output_tensor_shape(op, 1, iop.graph(), 3, new_dims);
            if (result != GraphStatus::Success) {
                errlog("!! change_output_tensor_shape failed");
            }
            return result;
        }
    };
    } // namespace
    // register the constructor hook for all the Ops using example_func
    //
    CTOR_OPHOOK((example_func<QUint8CroutonTensor>), ExampleFuncConstructorHook)
    CTOR_OPHOOK((example_func<QUint8CroutonTensor_TCM>), ExampleFuncConstructorHook)
    #endif
    Copy to clipboard

## Tips for Optimization

More fine-grained tiling opens up more opportunities for parallelism and
finer-grained data management, however it increases the amount of
metadata and per-op overhead.

It is recommended that the graph always remain correct. If there is a
rule that is optional (for example, tiling when the op handles arbitrary
sizes), use the same op name, but if there is a change in the behavior,
it is best to change the name. This way an implementation of the
original name or an implementation of the new op name has well-defined
behavior.

A part from all this, below are a set of graph level optimization that
are done by the core framework:

- **Constant Propagation**: If there is a chains of data that are all
constant, it will be evaluated when the graph is prepared.
- **Dead Code Elimination**: Code that is not used during execution
process is deleted.
- **Common Sub-expression Elimination**: Replaces identical expressions
with a single variable holding the computed value.

Last Published: Jun 04, 2026

[Previous Topic
HTP Core Headers for Op Packages](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/htp_core_headers.md) [Next Topic
QNN HTP Op Package API Revision History](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/opPackage_API_version_guide.md)