# Tensors and Memory Layout

## Details about Memory Layout

Memory layouts are how data for Tensors are laid out in memory in HTP
Core. There are many different memory layouts. There is
`d32 layout, crouton layout, flat layout`, and `specific layouts`
for weights in convolutions. Memory layouts have a rank, an optional
strided order of dimensions should be chunked out, and an order of how
chunks should be laid out beside each other.

### Examples

#### *Flat Layout*

`FlatMemoryLayout<4>` can be thought of as
`ChunkedMemoryLayout<4, 0,0, 1,0, 2,0, 3,0>`

- `4`, the zeroth parameter, is the `rank` of the layout
- The rest of the parameters are done in `(dimension,size)` pairs,
and it’s easiest to explain the pairs right-to-left:
- `3,0` means “all the rest of dimension `3`”
- `2,0` means “all the rest of dimension `2`”
- `1,0` means “all the rest of dimension `1`”
- `0,0` means “all the rest of dimension `0`”

What the above explanation means is that dimension `3` is the fastest
moving dimension, dimension `2` is `2nd fastest`, then `1`, then
`0`. If dimension of the tensor is `2x3x5x30` then the data is laid
out like the following:

(0,0,0,0), (0,0,0,1) ... (0,0,0,29),
    (0,0,1,0), (0,0,1,1) ... (0,0,1,29),
    ...
    (0,0,4,0), (0,0,4,1), ... (0,0,4,29),
    (0,1,0,0), (0,0,0,1), ... (0,1,0,29),
    (0,1,1,0), (0,1,1,1), ... (0,1,1,29),
    ...
    (0,2,1,0), (0,2,1,1), ... (0,2,1,29),
    ...
    (0,2,4,0), (0,2,4,1), ... (0,2,4,29),
    (1,0,0,0), (1,0,0,1), ... (1,0,0,29),
    ...
    (1,2,4,0), (1,2,4,1), ... (1,2,4,29),
    Copy to clipboard

Most commonly rank of `4` tensors with “`NHWC`” format is used: -
dimension `3` is `depth` or `channels` - dimension `2` is
`width` - dimension `1` is `height` - dimension `0` is
`batches`.

So in the above example, the numbers indicate:
(`batch index, height index, width index, depth index`). Ellipses
indicate non contiguous data. Newlines are still contiguous, only there
for readability.

If user wants the dimensions to mean “`NHWC`” format, but really want
it laid out in memory as “`NCHW`”, user can use the Memory Layout to
do this. `ChunkedMemoryLayout<4, 0,0, 3,0, 1,0, 2,0>` is a to
represent this.

By changing the `MemoryLayout` user can change how data is organized
in memory, without changing how ops using the basic tensor interfaces
work, and while having the `C++` infrastructure guarantee type safety
(so that user don’t feed `NHWC` data to an op expecting `NCHW`
format, for example).

### *Crouton Layout*

While the flat memory format is good for interaction with other
environments, user might like memory to be in a format highly amenable
to how hardware is able to work with it. This requires making the data
more uniform in size and ensuring that the data being run through
computation together is contiguous in memory.

Crouton layout `R4CroutonLayout` is
`ChunkedMemoryLayout<4, 0,0, 1,0, 2,0, 3,0, 1,8, 2,8, 3,32>`

- `4`, the zeroth parameter, is the rank of the layout
- The rest of the parameters are done in `(dimension,size)` pairs,
and it’s easiest to explain the pairs right-to-left:
- `3,32` means `32` elements in dimension `3`
- `2,8` means `8` contiguous chunks of everything to the right in
dimension `2`
- `1,8` means `8` contiguous chunks of everything to the right in
dimension `1`
- `3,0` means “all the rest of dimension `3`”
- `2,0` means “all the rest of dimension `2`”
- `1,0` means “all the rest of dimension `1`”
- `0,0` means “all the rest of dimension `0`”

The numbers `1,8, 2,8, 3,32` mean that the croutons have a chunk size
of `1x8x8x32`. If the dimensions is less than `8` or `32`, it is
**padded** to the respective dimensions.

For example, if the tensor dimension is `1x3x5x30`, the data gets
padded to `1x8x8x32`, and then the data is laid out like the
following:

(0,0,0,0), (0,0,0,1), ... (0,0,0,29),  (0,0,0,30), (0,0,0,31),
    Copy to clipboard

> 
> 
> `(0,0,0,0) to (0,0,0,29)` is valid data, `(0,0,0,30) (0,0,0,31)`
> is pad introduced by the memory layout

(0,0,1,0), (0,0,1,1), ... (0,0,1,29), (0,0,0,30), (0,0,0,31),
    ...
    (0,0,4,0), (0,0,4,1), ... (0,0,4,29), (0,0,4,30 ),(0,0,4,31),
    Copy to clipboard

> 
> 
> `(0,0,4,0) to (0,0,4,29)` is valid data,
> `(0,0,4,30 ), (0,0,4,31), (0,0,5,0 )...(0,0,7,31)` is pad
> introduced by the memory layout

(0,0,5,0), (0,0,5,1), ... (0,0,5,31),
    ...
    (0,0,7,0), (0,0,7,1), ... (0,0,7,31),
    (0,1,0,0), (0,1,0,1), ... (0,1,0,31),
    (0,1,1,0), (0,1,1,1), ... (0,1,1,31),
    ...
    (0,2,1,0), (0,2,1,1), ... (0,2,1,31),
    ...
    (0,2,4,0), (0,2,4,1), ... (0,2,4,29), (0,2,4,30), (0,2,4,31)
    Copy to clipboard

> 
> 
> `(0,2,4,0)` to `(0,2,4,29)` is valid data,
> `(0,2,4,30) ...(0,2,7,31),(0,3,0,0 )...(0,7,7,31)` is pad
> introduced by the memory layout

This explains what `1,8, 2,8, 3,32`means. It means how data is laid
out in fixed size chunks. However, the order of those chunks in memory
also needs to be determined.

Similar to the `FlatMemoryLayout` above, user can define an arbitrary
order for those chunks to be ordered. That is what the 0-sized
dimensions mean in the `MemoryLayout`.

So given the example here, where the ordering is `0,0, 1,0, 2,0, 3,0`,
groups along dimension `3` to be ordered “together”, followed by all
the groups required to do dimension `2`, and so on.

If there is `2x9x20x50` tensor, for example, it gets padded to
`2x16x24x64`. It would go in memory:

(0,0,0,0) ... (0,0,0,31)
    (0,0,1,0) ... (0,0,1,31)
    ...
    (0,0,7,0) ... (0,0,7,31)
    (0,1,0,0) ... (0,1,7,31)
    ...
    (0,7,7,0) ... (0,7,7,31) > end of chunk

    (0,0,0,32) ... (0,0,0,63) > start of chunk
    (0,0,1,32) ... (0,0,1,63)
    ...
    (0,0,7,32) ... (0,0,7,63)
    (0,1,0,32) ... (0,1,7,63)
    ...
    (0,7,7,32) ... (0,7,7,63) > end of chunk, finished traversing all 64 in dimension 3}
    
    (0,0,8,0) ...  (0,0,8,31)
    (0,0,9,0) ...  (0,0,9,31)
    ...
    (0,0,15,0) ... (0,0,15,31)
    (0,1,8,0) ...  (0,1,15,31)
    ...
    (0,7,8,0) ...  (0,7,15,31)
    
    (0,0,8,32) ... (0,0,15,63)
    ...
    (0,7,8,32) ... (0,7,15,63)
    (0,0,16,0) ... (0,0,23,63)
    ...
    (0,7,16,32) ... (0,7,23,63) > finished traversing all 24 in dimension 2
    
    (0,8,0,0) ... (0,8,23,63) (0,15,0,0) ... (0,15,23,63) > finished traversing all 16 in dimension 1
    (1,0,0,0) ... (1,8,23,63) (1,15,0,0) ... (1,15,23,63) > end of memory layout
    Copy to clipboard

Note that the `FlatMemoryLayout` is just the special case of
`ChunkedMemoryLayout` where the Chunk Size is the minimal one (1
element in every dimension).

### Practical tips working with Crouton:

- chunks are not consecutive in memory (i.e. there’s gap in memory
between each chunk)
- usually use `get_raw(first element's idx in chunk)` to retrieve the
start memory location for such chunk
- operations such as aligned copy can’t go across chunks
- crouton padding is automatic and is `31` (not `0`)
- user padding need to be explicitly set to quantized `0` (or other
specified value)

### A More complicated example of Memory Layout

For convolution, the weight layout is
`ChunkedMemoryLayout<4, 3,0, 2,0, 0,0, 1,0, 2,8, 3,32, 2,4>` In HTP
Core, the weight dimension 0 is considered to be filter height,
dimension `1` to be filter width, dimension `2` to match the number
of input channels, and dimension `3` to be the number of output
channels.

So in the case below,
`ChunkedMemoryLayout<4, 3,0, 2,0, 0,0, 1,0, 2,8, 3,32, 2,4>`, it
means: \* `4`, the zeroth parameter, is the rank of the layout \* The
rest of the parameters are done in `(dimension,size)` pairs, and it’s
easiest to explain the pairs right-to-left: \* `2,4` means `4`
contiguous elements in dimension `2` (which matches the input depth)
\* `3,32` means `32` contiguous chunks of everything to the right in
dimension `3` (the output depth) \* `2,8` means `8` contiguous
chunks of everything to the right in dimension 2 \* `1,0` means “all
the rest of dimension `1`” \* `0,0` means “all the rest of dimension
`0`” \* `2,0` means “all the rest of dimension `2`” \* `3,0`
means “all the rest of dimension `3`”

So if there is a `3x3x32x32` filter, the memory is laid out as
follows:

(0,0,0,0), (0,0,1,0), (0,0,2,0), (0,0,3,0),
    (0,0,0,1), (0,0,1,1), (0,0,2,1), (0,0,3,1),
    (0,0,0,2) ...                    (0,0,3,31),
    (0,0,4,0), (0,0,5,0), (0,0,6,0), (0,0,7,0),
    (0,0,4,1), (0,0,5,1), (0,0,6,1), (0,0,7,1),
    (0,0,4,2) ... (0,0,7,31), ...   (0,0,31,31),
    (0,1,0,0), (0,1,1,0), ...       (0,1,31,31),
    ...
    (0,2,0,0), ...                  (0,2,31,31),
    (1,0,0,0), ...                  (1,0,31,31)
    Copy to clipboard

So the rightmost fixed-size things indicate the block size of a
“`chunk`”. Then those chunks are ordered in the desired way for
computation. The `3,0, 2,0, 0,0, 1,0`, just means that if there is
more output channels (dimension `3`) or more input channels (dimension
`2`) those happen after the group of `blocks x width x height`, and
that the input channels (dim `2`) is more contiguous than output
channels (dim `3`). “0” here means “not a fixed size, just all the
rest of this dimension”

So a `3x3x32x50` tensor is just fine. It will get padded into
`3x3x32x64`. But the format says do `(0,0,0,0)...(2,2,31,31)` and
then do `(0,0,0,32)...(2,2,31,63)`

For even more clarity in this example, if there is `3x3x64x96` tensor,
it would go in memory:

(0,0,0,0)...(2,2,31,31),
    (0,0,32,0)...(2,2,63,31),
    (0,0,0,32)...(2,2,31,63),
    (0,0,32,32)...(2,2,63,63),
    (0,0,0,64)...(2,2,31,95),
    (0,0,32,64)...(2,2,63,95),
    Copy to clipboard

Because dimension `2` is “more major” than dimension `3`.

#### Different memory layouts

Based on the above description, here is a quick summary of how each of
the croutons are laid out in memory:

| Type | Memory Layout |
| --- | --- |
| `R4FlatMemoryLayout` | `FlatMemoryLayout<4>` |
| `R4NCHWMemoryLayout` | `ChunkedMemoryLayout<4, 0,0, 3,0, 2,0, 1,0>` |
| `R4Depth32MemoryLayout` | `ChunkedMemoryLayout<4, 0,0, 1,0, 3,0, 2,0, 2,4, 3,32>` |
| `R4CroutonLayout` | `ChunkedMemoryLayout<4, 0,0, 1,0, 2,0, 3,0, 1,8, 2,8, 3,32>` |
| `R4Crouton4x1Layout` | `ChunkedMemoryLayout<4, 0,0, 1,0, 2,0, 3,0, 1,8, 2,2, 3,32, 2,4>` |
| `R4Crouton2x2Layout` | `ChunkedMemoryLayout<4, 0,0, 1,0, 2,0, 3,0, 1,4, 2,4, 3,32, 1,2, 2,2>` |
| `R4Crouton2Layout` | `ChunkedMemoryLayout<4, 0,0, 1,0, 2,0, 3,0, 1,8, 2,2, 3,32, 2,2>` |

## Details about Tensors

Concrete tensors (real ones) have things like an underlying type,
interface, padding, and memory layout.

The underlying type is just that: what kind of data is actually kept in
the tensor.

The interface is the way information is encoded/decoded into the
underlying data type. For example, `PlainInterface` just returns the
value, but `ScaleOffsetInterface` applies the offset and scale value
(for quantized types).

One of the more interesting parts of a tensor is the Memory Layout. In
HTP core, user can have arbitrary memory layouts.

Memory Layouts have a set of fixed sizes, which define the size of each
chunk. They also have the ordering those chunks are arranged to fill out
the entire set of the tensor.

Let’s look at an example Crouton formats:

ChunkedMemoryLayout<
        /* RANK */ 4,
        /* Least Major: Batch Dim, all the rest */ 0,0,
        /* Next least major: height, all the rest */ 1,0,
        /* Next least major: width, all the rest */ 2,0,
        /* Next least major: depth, all the rest */ 3,0,
        /* 8 rows high */ 1,8,
        /* 8 columns wide */ 2,8,
        /* 32 channels deep */ 3,32> ChannelMajorCrouton;
    
    ChunkedMemoryLayout<
        /* RANK */ 4,
        /* Least Major: Batch Dim, all the rest */ 0,0,
        /* Next least major: height, all the rest */ 1,0,
        /* Next least major: width, all the rest */ 2,0,
        /* Next least major: depth, all the rest */ 3,0,
        /* 4 high */ 1,4,
        /* 4 wide */ 2,4,
        /* 32 channels deep */ 3,32,
        /* 2 rows */ 1,2,
        /* 2 cols */ 2,2> SpatialXYMajor;
    
    ChunkedMemoryLayout<
        /* RANK */ 4,
        /* Least Major: Batch Dim, all the rest */ 0,0,
        /* Next least major: height, all the rest */ 1,0,
        /* Next least major: width, all the rest */ 2,0,
        /* Next least major: depth, all the rest */ 3,0,
        /* 8 high */ 1,4,
        /* 2 wide */ 2,2,
        /* 32 channels deep */ 3,32,
        /* 4 cols */ 2,4> SpatialXMajor;
    Copy to clipboard

The infrastructure supports the use of all of these formats. Generic ops
can use any of the formats indicated here.

Last Published: Jun 04, 2026

[Previous Topic
Allocate Memory for Scratch Buffers](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/scratch_buffer.md) [Next Topic
Writing QNN HTP Op Package](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/writing_op_package.md)