# The Halide programming language

Halide is a domain-specific language for vision and image processing algorithms,
which allow you to author efficient image processing pipelines. Halide separates
the program into two conceptual parts:

- **Algorithm** – Defines what is computed at each pixel
- **Schedule** – Defines how the computation should be organized

Every Halide program must consist of both parts, which you are to specify.
Together, they are called the *pipeline*.

Note

This section is only a brief introduction to the Halide programming language.
It is not intended to be a complete guide to the Halide programming language.
More resources for learning the language can be found at
[https://halide-lang.org](https://halide-lang.org)

The following example shows a pipeline that is blurring an image in the
horizontal dimension.

1int main() {
    2    // Image with 8 bits per pixel.
    3    ImageParam input(UInt(8), 2)
    4    Halide::Func f;
    5    // Algorithm.
    6    f(x, y) =  (input(x-1, y) + input(x, y) + input(x+1, y))/3;
    7    // Schedule
    8    f.vectorize(x, 128).parallel(y, 16);
    9}
    Copy to clipboard

The algorithm concisely defines the `blur` computation in terms of the
`x` and `y` coordinates of a pixel. The schedule directs how the
computation should be conducted. This example directs Halide to
vectorize the `blur` algorithm by a factor of 128 pixels, parallelize the
algorithm by a factor of 16 rows, and launch it on the Hexagon cDSP.

The Halide HVX compiler automatically translates your high-level image algorithm
into efficient HVX object code.

## [Algorithm](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id7)

The algorithm defines what is computed at each pixel. Following are the
constructs that Halide provides to write an algorithm.

### [Func](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id8)

A `Func` represents a pipeline stage. It describes what value a pixel should
have.

A `Func` is backed by a memory buffer, which can be thought of as an image
(Halide pipelines pass buffers as `halide_buffer_t` objects; see [Memory](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#memory)). In
the following example, `f` is a `Func` that is used to compute an image
where every pixel is the sum of its `x` and `y` coordinates.

f(x, y) = x + y;
    Copy to clipboard

A `Func` can be an input to another `Func`. In this case, an image
processing pipeline can be constructed by cascading multiple `Funcs` (where
each is a pipeline stage) in producer-consumer relationships.

The following example is a Gaussian Blur filter represented in Halide.

ImageParam input(UInt(8), 2)
    Func bounded_input{"bounded_input"};
    Func input_16{"input_16"};
    Func rows{"rows"};
    Func cols{"cols"};
    Func output{"output"};
    
    bounded_input(x, y) = BoundaryConditions::repeat_edge(input)(x, y);
    input_16(x, y) = cast<int16_t>(bounded_input(x, y));
    rows(x, y) = input_16(x, y-2) + 4*input_16(x, y-1) + 6*input_16(x,y)+ 4*input_16(x,y+1) + input_16(x,y+2);
    cols(x,y) =rows(x-2, y) + 4*rows(x-1, y) + 6*rows(x, y) + 4*rows(x+1, y) + rows(x+2, y);
    output(x, y)= cast<uint8_t> (cols(x, y) >> 8);
    Copy to clipboard

In this example:

- `bounded_input`, `input_16`, `rows`, `cols`, and `output` are all
`Funcs`, each representing a stage of this 5x5 Gaussian Blur filter.
- `output` is the final stage of the pipeline, and the memory buffer that is
associated with `output` contains the blurred image.

### [Var](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id9)

A `Var` is a name to use as a variable when defining a `Func`. `Var`
instances are used to refer to the dimensions of a `Func`, although that
is not their only use. For example, a `Func` can be defined using a `Var`,
as shown in the following example.

blur_x(x , y) = (input(x-1, y) + input(x, y) + input(x+1, y))/3;
    Copy to clipboard

Tip

It is important to understand that `x` and `y` have no meaning on their
own. However, when used with `blur_x` and `input`, they signify the two
dimensions of these two `Funcs`.

### [Expr](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id10)

An `Expr` in Halide is composed of `Funcs`, `Vars`, and other `Exprs`.
Following are some examples.

Expr e = x + y;
    Output(x, y) = 3 * e + x;
    Copy to clipboard

In these examples, `x` and `y` are `Vars`, and `Output` is a `Func`.

### [Pure and Update definitions](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id11)

A `Func` can be defined in multiple passes. The first definition is a simple
definition for all the points in the domain of the `Func`. This is called the
Pure definition of the `Func`

For example, following is a Pure definition of `f`.

Func f{"f"};
    Var x{"x"}, y{"y"};
    f(x, y) = x + y;
    Copy to clipboard

This example is simply a mapping from `Vars` (`x` and `y`) to an `Expr`.
However, later definitions can include computed expressions on both sides of the
assignment.

f(3, 7) = 42;
    Copy to clipboard

These definitions are called Update definitions of a `Func`. A special case of
Update definitions is a class of definitions called Reduction definitions. They
are Update definitions that recursively refer to the value of `Func` at the
same site.

f(x, y) = f(x, y) * 10;
    Copy to clipboard

You can confine the update to a single row or a single column.

// Updates restricted to a row allow us to refer to
    // other values in the same column (x)
    f(x, 1) = f(x, 2) - f(x, 0);
    
    // Updates restricted to a column allow us to refer to
    // other values in the same row (y)
    f(1, y) = f(2, y) - f(0, y);
    Copy to clipboard

Pure definitions and Update definitions can be scheduled independently of each
other. To schedule Update definitions, use `Func::update(int)` to get a
handle to an update step. This is shown in the following example for the Pure
definition and the two update steps in the previous example.

// Schedule pure defintion
    f.vectorize(x, 128);
    // Schedule first update definition that works on one
    // row.
    f.update(0).vectorize(x, 128);
    // Schedule second update definition that works on one
    // column.
    f.update(1).parallel(y);
    Copy to clipboard

#### RDom

An `RDom` is short for a reduction domain, and it is a handy way to write
updates in a loop. Consider the following Pure definition of a `Func`:

f(x, y) = x + y;
    Copy to clipboard

To square the first ten rows requires ten Update definitions, as shown in the following
example.

f(x, 0) = f(x, 0) * f(x, 0);
    f(x, 1) = f(x, 1) * f(x, 1);
    f(x, 2) = f(x, 2) * f(x, 2);
    f(x, 3) = f(x, 3) * f(x, 3);
    f(x, 4) = f(x, 4) * f(x, 4);
    f(x, 5) = f(x, 5) * f(x, 5);
    f(x, 6) = f(x, 6) * f(x, 6);
    f(x, 7) = f(x, 7) * f(x, 7);
    f(x, 8) = f(x, 8) * f(x, 8);
    f(x, 9) = f(x, 9) * f(x, 9);
    Copy to clipboard

However, it would be much easier to put this in a loop that can be generated for
you. Use an `RDom` to rewrite the previous example as follows.

RDom r(0, 10);
    f(x, r) = f(x, r) * f(x, r);
    Copy to clipboard

`RDom` can also be used over multiple dimensions.

ImageParam input(type_of<uint8_t>(), 2);
    Func histogram{"histogram"};
    RDom r(0, input.width(), 0, input.height());
    
    // Pure definition.
    histogram(x, y) = 0;
    
    // Update definition - a reduction definition
    histogram(input(r.x, r.y)) += 1;
    Copy to clipboard

## [Schedule](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id12)

The schedule is the other integral part of a Halide program that specifies how
the computation of the algorithm is to be structured. It entails the
specification of storage allocation for, and the relative order of, computation
of the different stages (`Func`) of the algorithm with each other. It also
specifies how each stage is to be computed.

The following example shows a schedule for the 5x5 Gaussian Blur shown in the
previous example.

1ImageParam input(UInt(8), 2)
     2Func bounded_input{"bounded_input"};
     3Func input_16{"input_16"};
     4Func rows{"rows"};
     5Func cols{"cols"};
     6Func output{"output"};
     7
     8bounded_input(x, y) = BoundaryConditions::repeat_edge(input)(x, y);
     9input_16(x, y) = cast<int16_t>(bounded_input(x, y));
    10input_16(x, y) = cast<int16_t>(input(x, y));
    11rows(x, y) = input_16(x, y-2) + 4*input_16(x, y-1) + 6*input_16(x,y)+ 4*input_16(x,y+1) + input_16(x,y+2);
    12cols(x,y) =rows(x-2, y) + 4*rows(x-1, y) + 6*rows(x, y) + 4*rows(x+1, y) + rows(x+2, y);
    13output(x, y)= cast<uint8_t> (cols(x, y) >> 8);
    14
    15output.compute_root()
    16      .parallel(y, 16)
    17      .vectorize(x, 128);
    18rows.compute_at(op, y);
    Copy to clipboard

The schedule in the example specifies that the output stage (`Output`) is to
be computed by vectorizing the `x` dimension while the rows (`y` dimension)
are executed in parallel on multiple threads (each thread works on 16 rows of
the output). Further, it says that the `Func` `rows` are computed in the
`y` loop of `Output`. The following example shows the equivalent pseudo
C/C++ code for this schedule.

// parallel loop over Output.y
    for<parallel> (Output.y = Output.y.min; y < Output.y.extent/16; ++Output.y) {
    
      // Compute the Func rows here, in the 'y' loop of Output
      for (rows.y = rows.y.min; rows.y < rows.y.extent; ++rows.y) {
        for (rows.x = rows.x.min; rows.x < rows.x.extent; ++rows.x) {
          rows(rows.x, rows.y) = ...;
        }
      }
    
      for (Output.y_inner = 0; y_inner < 16; ++Output.y_inner) {
        for (Output.x = Output.x.min; Output.x < (Output.x.extent/128); ++Output.x) {
          for<vectorized_loop> (Output.x_inner = Output.x*128; Output.x_inner < Output.x*128 + 128; ++Output.x_inner) {
             op(Output.x_inner, Output.y_inner) = ...;
          }
        }
      }
    }
    Copy to clipboard

Note

The other stages of the pipeline are not present by name in the pseudo code.
By default, if a schedule is not specified for a `Func`, it is computed
inline in its consumer.

### [hexagon](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id13)

hexagon()
    Copy to clipboard

Halide can offload parts of a pipeline to the DSP with the `.hexagon`
scheduling directive. This directive is used in the two Offload modes
(see [Halide execution modes](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#executionmodes)). However, this scheduling directive requires
the `hvx` target feature to be used when
compiling the pipeline. In the following example, the `Func`, `output`, and
therefore the entire pipeline, are offloaded to the DSP and vectorized for HVX.

// Algorithm
    bounded_input(x, y) = BoundaryConditions::repeat_edge(input)(x, y);
    input_16(x, y) = cast<int16_t>(bounded_input(x, y));
    input_16(x, y) = cast<int16_t>(input(x, y));
    rows(x, y) = input_16(x, y-2) + 4*input_16(x, y-1) + 6*input_16(x,y)+ 4*input_16(x,y+1) + input_16(x,y+2);
    cols(x,y) =rows(x-2, y) + 4*rows(x-1, y) + 6*rows(x, y) + 4*rows(x+1, y) + rows(x+2, y);
    output(x, y)= cast<uint8_t> (cols(x, y) >> 8);
    
    // Schedule
    output.hexagon().vectorize(x, 128);
    Copy to clipboard

### [reorder](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id14)

reorder(dim_innermost, ... , dim_outermost)
    // dim_innermost : Innermost dimension in the new reordering
    // dim_outermost : Outermost dimension in the new reordering
    Copy to clipboard

The `reorder` directive reorders the dimensions of a `Func` in the order
of the dimensions as specified. This reordering leads to the reordering of the
loops of the loop nest computing that `Func`. The dimensions are specified
from the innermost loop going outwards.

The following example of a `Func` simply adds 10 to each element (or pixel) of
its input.

f(x, y) = input(x, y) + 10;
    Copy to clipboard

Halide generates the corresponding pseudo loop nest.

for (f.y = f.y.min; f.y < f.y.extent; ++f.y) {
      for (f.x = f.x.min; f.x < f.x.extent; ++f.x) {
        f(f.x, f.y) = input(f.x, f.y) + 10;
      }
    }
    Copy to clipboard

Now, reorder it as follows:

f(x, y) = input(x, y) + 10;
    f.reorder(y, x); // inner to outer, left to right.
    Copy to clipboard

Halide generates the corresponding pseudo loop nest.

for (f.x = f.x.min; f.x < f.x.extent; ++f.x) {
      for (f.y = f.y.min; f.y < f.y.extent; ++f.y) {
        f(f.x, f.y) = input(f.x, f.y) + 10;
      }
    }
    Copy to clipboard

### [split](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id15)

split(old_dim, new_outer_dim, new_inner_dim, factor, tail_strategy);
    // old_dim         : dimension to be split
    // new_outer_dim : name of the new outer dimension
    // new_inner_dim : name of the new inner dimension
    // factor          : factor to split old_dim by
    // tail_strategy : strategy to handle tail or remainder loop if factor does not divide old_dim completely
    Copy to clipboard

The `split` directive splits a dimension by a given factor into an inner and
outer dimension.

The inner dimension iterates from 0 to `factor`-1. This is useful because
after the split, the inner and outer dimensions can be used with their own
independent scheduling directives. For example, it is possible to vectorize
the inner dimension while unrolling the outer dimension.

The following example of a `Func` simply adds 10 to each element (or pixel)
of its input.

Var xi{"xi"}, xo{"xo"};
    Expr factor;
    f(x, y) = input(x, y) + 10;
    f.split(x, xo, xi, factor);
    Copy to clipboard

Halide splits dimensions `x` and then generates the corresponding pseudo loop
nest

for (f.y = f.y.min; f.y < f.y.extent; ++f.y) {
      for (f.xo = f.x.min; f.xo < (f.x.extent/factor); ++f.xo) {
        for (f.xi = 0; f.xi < factor; ++f.xi) {
          f.x = (f.xo * factor) + f.xi;
          f(f.x, f.y) = input(f.x, f.y) + 10;
        }
      }
    }
    Copy to clipboard

### [vectorize](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id16)

vectorize(dim)
    vectorize(dim, factor)
    vectorize(dim, factor, tailstrategy)
    // dim           : dimension to be vectorized
    // factor        : factor to split dim by for vectorization. In other words, the vectorization factor (optional)
    // tail_strategy : strategy to handle tail or remainder loop if factor does not divide dim completely (optional)
    Copy to clipboard

The `vectorize` directive vectorizes the dimension, `dim`, by the
vectorization `factor`. `factor` and `tailstrategy` are optional. If both
are not specified, the entire `dim` is vectorized. If at least `factor` is
specified, `dim` is split into two dimensions by `factor`. The outer
dimension retains the name `dim` and the inner dimension remains unnamed.
This unnamed inner dimension is vectorized completely.

### [unroll](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id17)

unroll(dim)
    unroll(dim, factor)
    unroll(dim, factor, tailstrategy)
    // dim           : dimension to be unrolled
    // factor        : factor to split dim by for unrolling. In other words, the unroll factor (optional)
    // tail_strategy : strategy to handle tail or remainder loop if factor does not divide dim completely (optional)
    Copy to clipboard

The `unroll` directive unrolls the dimension, `dim`, by the unroll factor,
`factor`. `factor` and `tailstrategy` are optional. If both are not
specified, the entire `dim` is unrolled. If at least `factor` is specified,
`dim` is split into two dimensions by `factor`. The outer dimension retains
the name `dim` and the inner dimension remains unnamed. This unnamed inner
dimension is unrolled completely.

### [parallel](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id18)

parallel(dim)
    parallel(dim, task_size)
    parallel(dim, task_size, tailstrategy)
    // dim           : dimension whose iterations should be executed in parallel
    // task_size     : The size of each task in terms of number of iterations of the loop in dim (optional)
    // tail_strategy : strategy to handle tail or remainder loop if task_size does not divide dim completely (optional)
    Copy to clipboard

In the loop nest generated for a `Func`, the loop traversing the dimension,
`dim`, is executed in parallel. If `task_size` is specified, `dim` is
split into two dimensions by `task_size`. The outer dimension retains the
name `dim` and the inner dimension remains unnamed.

The iterations of the outer dimension, `dim`, are executed in parallel. If
`task_size` is not specified, each iteration of `dim` is executed in
parallel, effectively resulting in a `task_size` of 1.

### [fuse](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id19)

fuse(inner_dim, outer_dim, fused_dim)
    // inner_dim     : Inner dimension to be fused
    // outer_dim     : Outer dimension to be fused
    // fused_dim     : The name of the resulting fused dimension
    Copy to clipboard

The `fuse` directive merges the two dimensions, `inner_dim` and
`outer_dim`, into one dimension, `fused_dim`. Thus, `fused_dim` is the
product of the two dimensions.

### [tile](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id20)

tile(dim_0, dim_1, new_outer_dim_0, new_outer_dim_1, new_inner_dim_0, new_inner_dim_1, factor_dim_0, factor_dim_1)
    tile(dim_0, dim_1, new_outer_dim_0, new_outer_dim_1, new_inner_dim_0, new_inner_dim_1, factor_dim_0, factor_dim_1, tailstrategy)
    // dim_0           : First (inner) dimension to be split
    // dim_1           : Second (outer) dimension to be split
    // new_outer_dim_0 : The name of the new outer dimension created by splitting dim_0
    // new_outer_dim_1 : The name of the new outer dimension created by splitting dim_1
    // new_inner_dim_0 : The name of the new inner dimension created by splitting dim_0
    // new_inner_dim_1 : The name of the new inner dimension created by splitting dim_1
    // factor_dim_0    : split factor for splitting dim_0
    // factor_dim_1    : split factor for splitting dim_1
    // tailstrategy    : strategy to handle tail or remainder loop if either of factor_dim_0 and factor_dim_1 does not divide the corresponding dimension completely (optional)
    Copy to clipboard

The `tile` directive splits the two dimensions, `dim_0` and `dim_1`, and
reorders their splits to create a tile of computation for the `Func`.

This directive is shorthand for a recurring pattern of combining the `split`
and `reorder` directives as follows:

1. `dim_0` is split into two dimensions, `new_outer_dim_0` and
`new_inner_dim_0`, by factor `factor_dim_0`.
2. `dim_1` is split into two dimensions, `new_outer_dim_1` and
`new_inner_dim_1`, by factor `factor_dim_1`.
3. These dimensions are then reordered in the order (innermost to outermost):
`new_inner_dim_0`, `new_inner_dim_1`, `new_outer_dim_0`,
`new_outer_dim_1`.

These steps lead to a tiled pattern of execution of the loop nest. To
understand this process, consider the same pipeline from the previous example
where `f` is the result of adding 10 to each element of `input`.

Var xi{"xi"}, xo{"xo"};
    Var yi{"yi"}, yo{"yo"};
    Expr xfactor, yfactor;
    f(x, y) = input(x, y) + 10;
    f.tile(x, y, xo, yo, xi, yi, xfactor, yfactor);
    Copy to clipboard

Halide generates the corresponding pseudo loop nest.

for (f.yo = f.y.min; f.yo < f.y.extent/yfactor; ++f.yo) {
      for (f.xo = f.x.min; f.xo < (f.x.extent/xfactor); ++f.xo) {
        for (f.yi = 0; f.yi < yfactor; ++f.yi) {
          f.y = (f.yo * yfactor) + f.yi;
          for (f.xi = 0; f.xi < factor; ++f.xi) {
            f.x = (f.xo * xfactor) + f.xi;
            f(f.x, f.y) = input(f.x, f.y) + 10;
          }
        }
      }
    }
    Copy to clipboard

Note

Remember that a single loop is split, while a loop nest is tiled.

### [prefetch](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id21)

prefetch(func_or_image, dim)
    prefetch(func_or_image, dim, offset)
    prefetch(func_or_image, dim, offset, strategy)
    // func_or_image : Func or ImageParam to prefetch
    // dim           : dimension to prefetch over
    // offset        : iteration offset (optional)
    // strategy      : PrefetechBoundStrategy for generated code (optional)
    Copy to clipboard

The `prefetch` directive is used to prefetch data for a `Func` or ImageParam
(`func_or_image`). The directive inserts a prefetch to the L2 cache in the
generated code. The data is prefetched is from the buffer for `func_or_image`.
The size of the prefetched data is the size of `func_or_image` required in the
`dim` loop of the `Func` on which this directive was used. If `offset` is
specified, the prefetch is issued `offset` iterations of the `dim` loop in
advance. If `strategy` is not specified, the default prefetch is
`PrefetchBoundStrategy::GuardWithIf`.

The available prefetch strategies are:

- `PrefetchBoundStrategy::GuardWithIf` – Guard the prefetch operation with an
if statement to keep the prefetch within bounds.
- `PrefetchBoundStrategy::Clamp` – Clamp the computed bounds using minimum
and maximum values to keep the prefetch within bounds.
- `PrefetchBoundStrategy::NonFaulting` – Do not guard or clamp the region to
prefetch.

`PrefetchBoundStrategy::NonFaulting` assumes that prefetching out of bounds
does not cause a fault. A fault can be avoided if out-of-bound reads do not
cross a page boundary, or if the memory buffer was allocated with sufficient
padding to keep all accesses within allocated bounds.

Using the same pipeline as an example, wherein the `f` is the result of
adding 10 to each element of `input`:

Func f{"f"};
    Var x{"x"}, y{"y"};
    f(x, y) = input(x, y) + 10;
    f.prefetch(input, y, 2);
    Copy to clipboard

The prefetch directive will insert a prefetch to load data from `input` in the
`y` loop in the loop nest computing `f`. It will prefetch into the L2 cache
only as much data of `input` as is required in the `y` loop. Lastly, the
data prefetched is the data required in iteration `y` + 2.

Halide generates the corresponding pseudo loop nest.

for (f.y = f.y.min; f.y < f.y.extent; ++f.y) {
      if (y+2 < input.y.extent) {
        // Prefetch only one row of input, because
        // for one row of f only one row of input
        // is needed
        prefetch(&input(0, y+2))
      }
      for (f.x = f.x.min; f.x < f.x.extent; ++f.x) {
        f(f.x, f.y) = input(f.x, f.y) + 10;
      }
    }
    Copy to clipboard

### [compute_root](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id22)

compute_root()
    Copy to clipboard

The `compute_root` directive indicates that the `Func` it was called on
should be computed before all of its consumers.

Following is an example of the producer-consumer pipeline.

g(x, y) = x*y;
    f(x, y) = g(x, y) + g(x, y+1) + g(x+1, y) + g(x+1, y+1);
    Copy to clipboard

If no schedule is specified, Halide generates the corresponding pseudo loop
nest.

int height = f.y.extent;
    int width = f.x.extent;
    int f[height][width];
    for (int y = 0; y < height; ++y) {
      for (int x = 0; x < width; ++x) {
        f[y][x] = x*y + x*(y+1) + (x+1)*y + (x+1)*(y+1);
      }
    }
    Copy to clipboard

You do not see the computation of `g` in this example because `g` is now
considered to have been computed *inline*.

Now, schedule `g` as shown in the following example.

g(x, y) = x*y;
    f(x, y) = g(x, y) + g(x, y+1) + g(x+1, y) + g(x+1, y+1);
    g.compute_root();
    Copy to clipboard

Following is the loop nest that Halide will generate.

// Compute all of g before f
    int height = f.y.extent;
    int width = f.x.extent;
    int g[height+1][width+1];
    for (int y = 0; y < height+1; ++y) {
      for (int x = 0; x < width+1; ++x) {
        g[y][x] = x*y;
      }
    }
    for(int y = 0; y < height; ++y) {
      for (int x = 0; x < width; ++x) {
        f[y][x] = g[y][x] + g[y+1][x] + g[y][x+1] + g[y+1][x+1];
      }
    }
    Copy to clipboard

In the two Offload modes (see [Halide execution modes](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#executionmodes)), the `compute_root`
directive can be used to schedule some `Funcs` (stages) on the CPU.

In the following example, the same pipeline has a slight change to use the
offload directive, `hexagon`.

g(x, y) = x*y;
    f(x, y) = g(x, y) + g(x, y+1) + g(x+1, y) + g(x+1, y+1);
    g.compute_root();
    Copy to clipboard

Following is the loop nest that Halide will generate.

// Compute all of g before f
    // so, compute g before going to Hexagon
    // to compute f
    int height = f.y.extent;
    int width = f.x.extent;
    int g[height+1][width+1];
    for (int y = 0; y < height+1; ++y) {
      for (int x = 0; x < width+1; ++x) {
        g[y][x] = x*y;
      }
    }
    // compute f on Hexagon.
    for<Hexagon>(int y = 0; y < height; ++y) {
      for (int x = 0; x < width; ++x) {
        f[y][x] = g[y][x] + g[y+1][x] + g[y][x+1] + g[y+1][x+1];
      }
    }
    Copy to clipboard

In this example, `g` is computed before moving on to the DSP; that is, `g`
is computed on the CPU (host processor). Halide generates the code required to
copy the data of `g` to the DSP for it to be used in the computation of `f`.

### [compute_at](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id23)

compute_at(consumer, dim)
    // consumer  : consumer Func
    // dim       : dim of the consumer where this Func should be computed
    Copy to clipboard

The `compute_at` directive is used on a `Func` that produces data to be
consumed by the `consumer` (not necessarily the only consumer). `compute_at` indicates
that the producer should be computed at the loop traversing the dimension,
`dim`, of `consumer`.

The memory allocated for the producer and the computed elements are both
adjusted to match the requirements of `consumer` at `dim`. That is, the
`compute_at` directive is used to specify *when* a producer should be
computed relative to its consumer. The Halide compiler infers how much of
the producer is to be computed.

The following example schedules `g`.

g(x, y) = x*y;
    f(x, y) = g(x, y) + g(x, y+1) + g(x+1, y) + g(x+1, y+1);
    g.compute_at(f, y);
    f.hexagon();
    Copy to clipboard

Halide generates the corresponding pseudo loop nest.

int height = f.y.extent;
    int width = f.x.extent;
    int f[height][width];
    for<Hexagon> (int y = 0; y < height; ++y) {
      // Only two rows of g are needed
      // per row of f. So allocate only
      // two rows to store g. Number of
      // columns of g needed for one row
      // of f is width+1.
      int g[2][width+1];
      for (int x = 0; x < width; ++x) {
        g[0][x] = x*y;
        g[1][x] = x*(y+1);
      }
      for (int x = 0; x < width; ++x) {
        f[y][x] = g[0][x] + g[1][x] + g[0][x+1] + g[1][x+1];
      }
    }
    Copy to clipboard

Because this example scheduled `g` in the `y` loop over `f`, schedule `g` will be computed on the Hexagon DSP.

### [store_root](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id24)

store_root()
    Copy to clipboard

This directive says that the memory for the `Func` it was called on
should be allocated at the outermost level.

Consider the following producer-consumer pipeline that shows how to use
`store_root` and `compute_at` together.

g(x, y) = x*y;
    f(x, y) = g(x, y) + g(x, y+1) + g(x+1, y) + g(x+1, y+1);
    // Tell Halide to allocate storage for all of g at the
    // outermost loop level, but still compute it per row
    // (y) of f.
    g.store_root().compute_at(f, y);
    f.hexagon();
    Copy to clipboard

Halide generates the corresponding pseudo C/C++ code.

int height = f.y.extent;
    int width = f.x.extent;
    int f[height][width];
    int g[height+1][width+1];
    for (int y = 0; y < height; ++y) {
       if(y==0)
         g[y][x] = x*y;
       g[y+1][x+1] = (x+1)*(y+1);
       for (int x = 0; x < width; ++x) {
         f[y][x] = g[y][x] + g[y][x+1] + g[y+1][x] + g[y+1][x+1];
       }
     }
    Copy to clipboard

As shown in this example, storage for all of `g` is allocated at the top
level. However, even if two rows of `g` are required per row (y coordinate)
of `f`, only one row of `g` is calculated (except at the top edge when
`y` is 0). At every row of `f`, one row of `g` is reused because storage
for all of `g` has been allocated. This way, redundant computations are
reduced at the cost of an increased memory footprint.

### [store_at](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id25)

store_at(consumer, dim)
    // consumer  : consumer Func
    // dim       : dim of the consumer where storage for this Func should be
    //             allocated
    Copy to clipboard

The `store_at` directive is used on a `Func` that produces data (producer)
that is consumed by the `consumer` (not necessarily the only consumer). This directive
indicates that storage for the producer should be allocated at the loop
traversing the dimension, `dim`, of `consumer`. The amount of memory
allocated for the producer is adjusted to match the requirements of
`consumer` at `dim`.

That is, this directive is used to specify *when* memory for a producer should
be allocated relative to its consumer. The Halide compiler infers how much memory is
to be allocated for the producer.

The following producer-consumer pipeline shows how to use `store_at` and
`compute_at` together.

g(x, y) = x*y;
    f(x, y) = g(x, y) + g(x, y+1) + g(x+1, y) + g(x+1, y+1);
    // Tell Halide to allocate storage for g at the y loop
    // over f, but compute it per x coordinate of f.
    g.store_at(f, y).compute_at(f, x);
    f.hexagon();
    Copy to clipboard

Halide generates the corresponding pseudo C/C++ code.

int height = f.y.extent;
    int width = f.x.extent;
    int f[height][width];
    int g[height+1][width+1];
    for (int y = 0; y < height; ++y) {
      // Memory is allocated to hold only 2 rows of g
      // because 2 rows of g are needed per row (y coordinate)
      // of f.
      int g[2][width+1];
      for (int x = 0; x < width; ++x) {
        g[0][x] = x*y;
        g[0][x+1] = (x+1)*y;
        g[1][x] = x*(y+1);
        g[1][x+1] = (x+1)*(y+1);
        g[2][x+1] = (x+1)*(y+1);
        f[y][x] = g[y][x] + g[y][x+1] + g[y+1][x] + g[y+1][x+1];
      }
    }
    Copy to clipboard

### [hexagon_user_dma](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id26)

New in version 2.3.0.

hexagon_user_dma()
    Copy to clipboard

The `hexagon_user_dma` directive is used on a `Func` that is a simple copy.
When used on such a stage, the copy is performed using User DMA, first
introduced in Hexagon ISA v68. Following is a typical example of this scheduling
directive.

Func f{"f"};
    Input<Buffer<uint8_t>> input{"input", 2};
    Var x{"x"}, y{"y"};
    
    f(x, y) = input(x, y);
    f.hexagon_user_dma();
    Copy to clipboard

`hexagon_user_dma` is a specialization of `halide_buffer_copy` (see
`Halide/include/HalideRuntime.h`).

The source and destination buffers can have different shapes when the shape of
the destination buffer is completely contained inside the source buffer. This
means the `min` of each dimension of the destination buffer must be greater than
or equal to the `min` of the corresponding dimension of the source buffer.

Also, `min + extent` of each dimension of the destination buffer must be less
than or equal to the `min + extent` of the corresponding dimension of the
source buffer.

To use `hexagon_user_dma`, the `hvx_v68` target feature must be specified
in the target for the generator ([Ahead-of-time compilation](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#aheadoftimecompilation)).

### [align_storage](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id27)

align_storage(dim, alignment)
    // dim       : dimension to be aligned
    // alignment : alignment of dimension in number of elements
    Copy to clipboard

Performance on the DSP, especially HVX, is sensitive to the alignment of data.
The `align_storage` directive pads up the extent of storage of the dimension,
`dim`, to be an exact multiple of the alignment. This means the dimensions
stored outside of `dim` have strides that are multiples of `alignment`.
Strides and alignment are measured in the number of elements.

For example, the following schedule ensures that each row (scanline) of `g`
will start at an offset that is a multiple of 16 elements from the base address
of the buffer for `g`.

g(x, y) = x*y;
    f(x, y) = g(x, y) + g(x, y+1) + g(x+1, y) + g(x+1, y+1);
    // Dimensions outside of x, that is the y dimension of
    // g will start at an offset that is a multiple of 16
    // elements from the start of g.
    g.align_storage(x, 16);
    Copy to clipboard

### [TailStrategy](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id28)

`TailStrategy` is not a scheduling directive. However, several scheduling
directives, either explicitly or implicitly, split a dimension in two: an inner
and an outer dimension. For example, `split` explicitly splits a dimension,
while `vectorize` and `parallel` implicitly split a dimension. In terms of
C/C++ code, this means splitting the loop that traverses a dimension into two
nested loops.

An important aspect of any such splitting is what should be done when the split
factor does not fully divide the extent of the dimension. The so-called
remainder loop can be handled in several different ways, and all such
directives (such as `vectorize`, `parallel`, and `split`) take one last
optional parameter called `TailStrategy`, which defines how the remainder
loop should be handled. Following are the different types of `TailStrategy`:

- `TailStrategy::RoundUp`
    - Rounds up the extent to be a multiple of the split factor.

The benefit is that when vectorizing, Halide ensures that even the remainder
loop is vectorized.

The drawback, however, is that when this strategy is used to split the
dimension of a stage that reads or writes to an external buffer, it assumes
the input or output image size is a multiple of the split factor. If the
size is not a multiple of the split factor, the image is accessed out of
bounds. This case causes the application to fault unless the image
allocation is sufficiently padded to allow for the out-of-bounds access.

- `TailStrategy::GuardWithIf`
    - Guards the inner loop with an if condition and thereby ensures that Halide
never evaluates beyond the extent of the dimension.

This strategy is always legal and does not constrain the size of the input
or the output. However, the drawback is that it inserts a conditional into
the inner loop, and, in the remainder case, vectorization generates scalar
code.

- `TailStrategy::ShiftInWards`
    - Ensures that Halide does not evaluate beyond the extent of the original
dimension by shifting the remainder case inwards.

This strategy is always legal. If it is used on a stage that reads or
writes to an external buffer, it only requires that the extent of the
buffer be at least as large as the split factor.

Like `TailStrategy::RoundUp`, this strategy also supports vectorization,
because the inner loop will always be split-factor wide with no
conditionals in it. However, some values are redundantly computed near the
end of the dimension.

- `TailStrategy::Auto`
    - For pure definitions, this strategy implies using `TailStrategy::ShiftInwards`.

For pure `Vars` in update definitions, it implies
`TailStrategy::RoundUp`. For `RVars` in update definitions, it implies
`TailStrategy::GuardWithIf`.

## [Memory](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id29)

Halide pipelines work on memory buffers. However, they do not accept raw
pointers to data. Instead, all Halide pipelines accept and operate on instances
of `halide_buffer_t`.

`halide_buffer_t` is a C-style structure. `Halide::Runtime::Buffer` is a
utility C++ class that allows you to easily initialize and set up a
`halide_buffer_t` object. The rest of this section refers to them
interchangeably as buffers. Depending on the scope and lifetime of buffers,
they are primarily classified into two types: external and internal.

### [External Buffers](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id30)

When a Halide pipeline is compiled, the generated interface accepts buffers
(pointers to `halide_buffer_t` objects) as input and output data. Such
buffers are called external buffers because you or the caller of the pipeline
allocate and initialize their memory outside of the pipeline.

The utility class, `Halide::Runtime::Buffer`, encapsulates a
`halide_buffer_t` object and provides convenient methods for managing the
`halide_buffer_t` object inside.

### [Internal Buffers](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id31)

There are buffers to store data of the various `Funcs` inside a pipeline.
These buffers are created and used only inside the pipeline. The Halide runtime
manages the memory for these buffers in a manner that is transparent to you.

The size of these buffers is determined by the pipeline schedule. For example,
if a `Func` scheduled `compute_at` instead of `compute_root`, only part
of the entire domain of the `Func` might be required in memory at any time.
This means that where the `Func` is scheduled in the pipeline
(`compute_at`/`compute_root`) plays a factor in how much memory is
allocated.

### [Device Interface](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id32)

An object of `halide_buffer_t` is a raw representation of an image. It is
the data that a Halide pipeline operates on and produces. It holds information
such as the location of the data held by the buffer in main memory, the number
of dimensions of the buffer, and the size of each dimension. It also holds
information that tracks whether the buffer is in main memory or on a peripheral
device (like a GPU or the Hexagon DSP).

Each device (peripheral) target provides an interface
(`halide_device_interface_t`) that manages device allocations. It provides an
API to allocate buffer memory on the device, and to copy data from the host/CPU
to the device and back.

The `halide_device_interface_t` provided by the Hexagon DSP can be retrieved
by calling the function, `halide_hexagon_device_interface()`.
Examples are in the following chapter.

## [Ahead-of-time compilation](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id33)

Halide pipelines can be executed in just-in-time (JIT) compilation mode as well
as compiled in the more traditional ahead-of-time (AOT) mode before execution.
Because we recommend AOT compilation for Halide on HVX, this section discusses
only ahead-of-time compilation.

Typically, AOT compilation in Halide is a two-step process:

1. The generator is compiled into a generator executable using the native
compiler (for example, x86 compiler).
2. The generator executable is like a cross-compiler: when run, it invokes the
Halide compiler and compiles your Halide pipeline into an object file or a
static library, depending on the options passed to the generator.

### [Generators](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#id34)

Generators in Halide are a more structured way to do AOT compilation of Halide
pipelines. A `Generator` is a C++ class that is used to define a Halide
pipeline. By having your generator inherit from `Halide::Generator`, you get
a uniform interface to define and schedule the pipeline.

When a generator is compiled with `GenGen.cpp`, it is provided as part of the
Halide installation. You get a generator executable that accepts various
command line options and produces object code for the pipeline defined in the
generator.

In the following example, a pipeline replaces each pixel with the maximum pixel
in the 3x3 block around it.

#include "Halide.h"
    
    using namespace Halide;
    
    class Dilate3x3 : public Generator<Dilate3x3> {
    public:
        // Takes an 8 bit image; one channel.
        Input<Buffer<uint8_t>> input{"input", 2};
        // Outputs an 8 bit image; one channel.
        Output<Buffer<uint8_t>> output{"output", 2};
    
        GeneratorParam<bool> use_parallel_sched{"use_parallel_sched", true};
        GeneratorParam<bool> use_prefetch_sched{"use_prefetch_sched", true};
    
        void generate() {
          Expr height = input.height();
          bounded_input(x, y) = BoundaryConditions::repeat_edge(input)(x, y);
          max_y(x, y) = max(bounded_input(x, clamp(y-1, 0, height-1)),
          output(x, y) = max(max_y(x-1, y), max_y(x, y), max_y(x+1, y));
        }
    
        void schedule() {
          if (get_target().has_feature(Target::HVX)) {
            Expr ht = output.dim(1).extent();
            bounded_input
               .compute_at(Func(output), y)
               .align_storage(x, 128)
               .vectorize(x, vector_size*2, TailStrategy::RoundUp);
            output
               .hexagon()
               .split(y, yo, y, ht/2)
               .tile(x, y, xi, yi, vector_size*2, 8, TailStrategy::RoundUp)
               .vectorize(xi)
               .unroll(yi);
            if (use_prefetch_sched) {
              output.prefetch(input, y, 2);
            }
            if (use_parallel_sched) {
              output.parallel(yo);
            }
          }
        }
    private:
         Var x{"x"}, y{"y"}, yo{"yo"};
         Func max_y{"max_y"};
         Func bounded_input{"bounded_input"};
    };
    HALIDE_REGISTER_GENERATOR(Dilate3x3, dilate3x3)
    Copy to clipboard

The inputs and the outputs of the pipeline are declared as public members of
the pipeline and are instances of `Input` and `Output`. They will appear in
the signature of the generated function in the order in which they are declared.

To construct the pipeline, the `generate` method must be defined. To schedule
the constructed pipeline, the `schedule` method must be defined. Compile them
with `GenGen.cpp` provided with the Halide installation. `GenGen.cpp`
provides boilerplate code that accepts options on the command line to compile
your pipeline. Register the pipeline with this boilerplate code by adding the
following line:

HALIDE_REGISTER_GENERATOR(Dilate3x3, dilate3x3)
    Copy to clipboard

This line tells the boilerplate code that `Dilate3x3` defines the class that
is a Generator. Once the class is compiled, it can be identified on the command
line as `dilate3x3`. After compiling this line with `GenGen.cpp`, run the
resulting executable to generate an object file that you can link with your
application code.

`use_prefetch_sched` and `use_parallel_sched` are parameters that can be
passed to the generator on the command line. The following example shows how to
compile and run a generator.

$(HOST_CXX) -std=c++11 -I $(HALIDE_ROOT)/include -stdlib=libc++ -O3 -g -fno-rtti -rdynamic dilate3x3_generator.cpp $(HALIDE_ROOT)/lib/libHalide.a $(HALIDE_ROOT)/tools/GenGen.cpp
      -o dilate3x3.generator  -lz -lrt -ldl -lpthread -lm
      ./dilate3x3.generator -g dilate3x3 -o ./output_dir -e o,h -f dilate3x3 target=arm-64-android-hvx use_parallel_sched=true use_prefetch_sched=true
    Copy to clipboard

Once this code is run, you will see `dilate3x3.o` and `dilate3x3.h` in the
output directory, `output_dir`.

Following are the command line options that a generator accepts:

- `-g generator_name`
    - A `Generator` file can have multiple generators, and this option selects
which one to run (identified by the second argument to
`HALIDE_REGISTER_GENERATOR`).

- `-o directory`
    - Specifies the directory in which to create the outputs.

- `-e csv-list-of-outputs`
    - A list of comma-separated values that specify the types of output to create.

Acceptable values are `static_library`, `object`, `c_header`,
`assembly`, `bitcode`, `stmt`, and `stmt_html`, where:

- `assembly` generates assembly equivalent to the generated object file.
- `bitcode` generates LLVM bitcode for the pipeline.
- `stmt` generates human-readable pseudo code for the pipeline.
- `stmt_html` generates an HTML version of the pseudo code, which can be
better to read than the raw STMT file.

- `target=<target>`
    - The target to compile for.

Typically, the target is a hyphenated string of the form,
`arch-bits-OS-feature1-feature2..`, where:

- `arch` can be one of the following: `arch_unknown`, `arm`,
`hexagon`, `mips`, `powerpc`, `riscv`, `wasm`, or `x86`.
- `bits` is either 32 or 64.
- `OS` can be one of the following: `android`, `fuchsia`, `ios`,
`linux`, `noos`, `os_unknown`, `osx`, `qurt`, `wasmrt`, or
`windows`.
- If `arch`, `bits`, or `OS` is omitted, the default target is to the host.

The Halide compiler provides several features to consider from an HVX
perspective:

- `hexagon_dma`, `hvx`, `hvx_shared_object`,
`hvx_sysmon`, `hvx_v66`, and `hvx_v68`.
- Also, `debug` and `profile` are useful.
- `no_asserts` can be used to disable the injection of runtime asserts in the
pipeline code. However, we do not recommend this use because asserts are
lightweight and add negligible overhead in terms of execution time.
- The target can begin with `host`, which sets the host’s architecture,
operating system, and feature set. For example, `host-hvx` is a valid
target that will ensure execution in Simulator Offload mode.

The target string implies the execution mode for a Halide-for-HVX
pipeline ([Halide execution modes](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#executionmodes)). The following table provides sample
Halide-for-HVX target strings.

| Target string | Implied mode |
| --- | --- |
| `arm-64-android-hvx` | Device Offload mode with HVX support |
| `arm-64-linux-hvx` | Device Offload mode with HVX support on LE devices |
| `arm-64-android-hvx-hvx_v66` | Device Offload mode with HVX support and Hexagon V66 ISA |
| `arm-64-android-hvx-debug` | Device Offload mode with HVX support and debug logs |
| `host-hvx` | Simulator Offload mode with  HVX support |
| `hexagon-32-noos-hvx` | Simulator Standalone mode with HVX support |
| `hexagon-32-qurt-hvx` | Device Standalone mode with HVX support |

For more information about target features relevant to the Hexagon DSP, see
[Relevant target features for HVX](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#relevanttargetfeaturesforhvx).

Last Published: Jul 08, 2024

[Previous Topic
Getting started](https://docs.qualcomm.com/bundle/publicresource/80-PD002-1/topics/getting_started.md) [Next Topic
Halide for HVX](https://docs.qualcomm.com/bundle/publicresource/80-PD002-1/topics/halide_for_hvx.md)