# Code optimization

The LLVM compilers provide many tools and features for improving the
size or speed of the generated object code.

Note

We strongly recommend that you try using the various code
optimizations to improve the performance of your program. Using only
the default optimization settings might result in suboptimal
performance.

## Optimize for performance

LLVM currently generates the fastest code when compiling for Arm
mode.

**Table 5-1 Options to use for optimizing code performance**

| **Core** | **Options** |
| --- | --- |
| Armv7 | -Ofast -mcpu=krait |
| Armv8 (AArch32) | -Ofast -mcpu=cortex-a57 |
| Armv8 (AArch64) |  |

For more information on -Ofast, see [Parallelization](https://docs.qualcomm.com/doc/80-VB419-99/topic/use_the_compilers.html#sec-parallelization).

## Optimize for code size

LLVM currently generates the smallest code when compiling for Thumb-2 mode.

Note

Thumb-2 is available only on Armv6-T2 core, Armv7 core, and AArch32.

**Table 5-2 Options to use for optimizing code size**

| **Core** | **Options** |
| --- | --- |
| Armv7 | -Osize -mcpu=krait |
| Armv8 (AArch32) | -Os -mcpu=cortex-a57 |
| Armv8 (AArch64) |  |

For the Armv7 core, the -Osize option is preferred over -Os because
it enables additional optimizations for code size. For more
information on -Osize, see [Optimization](https://docs.qualcomm.com/doc/80-VB419-99/topic/use_the_compilers.html#sec-optimization).

## Automatic vectorization

LLVM includes support for automatic code vectorization. By default,
the vectorizer is enabled at code optimization level -O2 or higher.
To enable it at lower optimization levels use the

`-fvectorize-loops` option ([Vectorization](https://docs.qualcomm.com/doc/80-VB419-99/topic/use_the_compilers.html#sec-vectorization)).

Vectorization can be used at any code optimization level higher than
-O0. To see which loops in a program get vectorized, use the
following option:

`-fvectorize-loops-debug`

Vectorization works only with the Armv7 or Armv8 processor
architecture with the NEON extension. NEON is enabled either
implicitly (by specifying a CPU that has this extension with the
-mcpu flag) or explicitly with -mfpu=neon.

The following example is a loop that can be vectorized with
`-fvectorize-loops`:

void foo(int \* restrict A, int N) {
    
    for (int i = 0; i < N; i++) A[i] = A[i] + 1;
    
    }
    Copy to clipboard

For vectorization of floating point computation, the GCC option
`-ffast-math` should be specified. Because floating point
vectorizations (reductions in particular) are not IEEE compliant, the
fast math option is required to ensure maximum vectorization of
floating point computations.

The vectorizer can also be enabled using the -ftree-vectorize option,
which is an alias for `-fvectorize-loops`. The vectorizer currently
operates only on the innermost loop of a nested loop.

Note

The GCC option -ftree-vectorizer-verbose (for printing out
verbose information on a vectorized loop) is not supported.
Instead, use `-fvectorize-loops-debug`.

## Automatic parallelization

The Qualcomm LLVM compilers include support for automatic code
parallelization. By default, parallelization is disabled; to enable
it, use the -fparallel option ([Parallelization](https://docs.qualcomm.com/doc/80-VB419-99/topic/use_the_compilers.html#sec-parallelization)).

Parallelization can be used only with code optimization level -O2,
-O3, -O4, or -Ofast.

Automatic code parallelization enables selected loops to be executed
in parallel for faster performance. During parallelization, if a loop
is determined to be free of any data, control, or memory
dependencies, it is then split into multiple loops, each of which
performs part of the work from the original loop. The resulting loops
are dispatched to work queues on separate cores so they can be
executed in parallel.

Parallelization requires a runtime component that is linked into the
final executable image. The purpose of the component is to initialize
a new thread at program initialization time, and subsequently to
manage the work queues during parallel execution.

While automatic code parallelization can significantly improve
overall performance by distributing work across multiple cores, it
accomplishes this by putting otherwise underutilized cores to use.
Because other cores are used, performance becomes a function of the
entire system, and is not fully determinable at compile time. Thus it
is possible for performance to improve, but also for the net
performance to decline. Although the threads maintain the cores in a
power-saving mode when they are not working, the additional work that
is done in parallel can increase the overall power usage. For this
reason, automatic code parallelization is not enabled by default in
the compiler, and its use must be evaluated on a case-by-case basis.

## Merge functions

LLVM includes support for function merging. By default, this
optimization is disabled; to enable it, use the -fmerge-functions
option ([Code generation](https://docs.qualcomm.com/doc/80-VB419-99/topic/use_the_compilers.html#sec-code-generation)).

Function merging attempts to improve code size by merging functions
that are equivalent or differ in only a few instructions. The
optimization uses a number of heuristics to determine whether it is
worthwhile to merge a pair of functions. For instance, very small
functions or functions with significant differences are usually not
merged.

The following example shows how function merging works:

int f1(int a, int b) { int f2(int a, int b) { int x; int x;
       x = a + 4; x = a + 10;
       return x * b; return x * b;
       }
    }
    Copy to clipboard

Function merging determines that functions f1 and f2 are similar, and
replaces them with the following functions:

int f1 merged(int a, int b, int choice) { int x;
       if (choice)
          x = a + 10;
       else
          x = a + 4; return x * b;
    }
    int f1(int a, int b) {
       return f1 merged(a, b, 0);
    }
    int f2(int a, int b) {
       return f1 merged(a, b, 1);
    }
    Copy to clipboard

This example is for illustration purposes only. In practice, the
optimizer would determine that functions f1 and f2 are too small to
be worth merging.

Note

Because function merging might have a negative impact on
program performance, it is disabled by default, and becomes enabled
only when it is specified explicitly.

## Link-time optimization

Link-time optimization (LTO) comprises a set of powerful
inter-modular optimizations that are performed during the linking
stage of compilation.

LTO expands the scope of optimizations from individual modules to the
entire program (or at least to all the modules visible at link time).
This enables deeper compiler analysis (such as better alias analysis)
and more effective code transformations (such as function inlining),
which can result in improved performance and code size.

When used with -c, the -flto option produces a file containing the
LLVM compiler’s intermediate representation (also known as
*bitcode*). This file can be subsequently used in a final link step
that then performs inter-module code optimizations on the file
contents.

LTO comprises the following elements:

- The link-time optimizer, a compiler feature (controlled with -flto)
that performs the inter-modular optimizations while linking the files
together.
- The LTO-specific attribute lto\_preserve, which, when applied to a C
or C++ function or variable, prevents it from being discarded by the
link-time optimizer.

The Snapdragon LLVM Arm linker has been verified to support LTO on
Armv7 and Armv8 targets, and Linux and Windows hosts. The GNU Gold
linker might support LTO for Armv8 targets, depending on the GCC
toolchain/sysroot version used. LTO is not supported on Windows using
the Gold linker.

For more information on the Snapdragon LLVM Arm linker, see the
*Qualcomm Snapdragon LLVM Arm Linker User Guide* (80-VB419-10).

For more information on the Gold linker, go to:
[llvm.org/docs/GoldPlugin.html](http://llvm.org/docs/GoldPlugin.html)

### Link-time optimizer

The link-time optimizer is invoked with the following command:

`clang -flto <input_files...>`

The optimizer inputs several LLVM bitcode files or archives. It then
links the specified files together, performs the specified
inter-modular optimizations on them as a whole, and finally generates
a single assembly file containing the optimized result.

An important optimization that the optimizer performs is the
aggressive removal of any functions that it determines are not used.
To provide the optimizer with a larger context for determining if a
function is used, the list of filenames can include additional
non-bitcode objects and archives. The optimizer will use the symbol
information in these files to determine if a function should be
preserved.

Note

The optimizer requires archives to be homogeneous. The
members of a given archive must be either all bitcode files or all
object files.

## Profile-guided optimization

<abbr title="Profile-guided optimization">PGO</abbr> is a two-step process:

1. A program is first executed to collect profile information on it.

2. The program is then recompiled, this time using the collected profile
information to improve the code optimization that can be performed on
the program.

The availability of accurate source code profile information enables
the compiler to generate better optimized code. The compiler can
focus on costly high-performance optimizations (in terms of code size
or compile time) at the profile-identified hot spots, while limiting
adverse code generation trade-offs to pathways that are relatively
cold.

PGO can use two different kinds of profile information:
instrumentation-based profiling and sampling-based profiling

Each method offers distinct advantages and disadvantages when
performing PGO. However, both provide the compiler with useful
information for improving code optimization.

PGO uses the same compile options that are described at:
[clang.llvm.org/docs/UsersManual.html#profile-guided-optimization](http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization)

### Instrumentation-based PGO

The instrumentation-based approach to PGO relies on a special build
of your code, which inserts instrumentation that generates the
appropriate profile information. The resulting information can be
used for PGO during a subsequent build.

An instrumented binary has extra runtime overhead and executes more
slowly than normal, but the generated profile information still
accurately reflects the code’s uninstrumented execution.

The following procedure explains how to perform instrumentation-based
PGO.

1. Build the instrumented application.

    Use the compile option, -fprofile-instr-generate, to compile and link
the application code. For example:

clang++ -O2 -fprofile-instr-generate source.cc -o application
        Copy to clipboard

    `-fprofile-instr-generate` optionally accepts a filename argument that
specifies the name and location of the raw profile data file to be
created. Otherwise, the file will be created with the default name
and location.
2. Generate profile information.

    Run the built application on your device to generate the profile
information. For example, to run this application on Android, enter
the following commands:

HOST$: adb push application /data/local/tmp
        HOST$: adb shell**
        DEVICE$ cd /data/local/tmp
        DEVICE$ ./application**
        Copy to clipboard

    This command sequence creates the raw profile data file, */sdcard/default.profraw*.
3. Convert the profile information.

    Generate profile information in one of two ways:

    - Run the instrumented program once, which results in a single set of
profile information
    - Run the instrumented program several times with different input data,
which results in several sets of profile information.

    In either case, the collected raw profiles must be converted to a
file format profile that is compatible with the Snapdragon LLVM
version of PGO. To do this, use the LLVM tool llvm-profdata and its
*merge* functionality. For example:

llvm-profdata merge -output=application.profile dataset-1.profraw dataset-2.profraw
        Copy to clipboard

    This example inputs two raw profile files (dataset-1.profraw and dataset-2.profraw),
merges their contents, converts the merged profiles to a format usable in PGO, and
writes the merged data to the file, application.profile. The raw profile data file
can be merged with an existing merged profile data file or with multiple profile
data files that have already been merged.

Note

The merge step is required even if you only have a single profile file.
4. Rebuild the application using PGO.

    Enable PGO in your application builds by using the profile data
generated in the previous step. For example:

clang++ -O3 -fprofile-instr-use=application.profile source.cc -o application
        Copy to clipboard

Note

PGO profiles can be used at any code optimization level,
and with any other compile option ([Profile resiliency](https://docs.qualcomm.com/doc/80-VB419-99/topic/code_optimization.html#sec-profile-resiliency)).

### Sampling-based PGO

The sampling-based approach to PGO requires two external tools to set
up the profile information:

- Profile generator: Linux perf profiler ([perf.wiki.kernel.org](https://perf.wiki.kernel.org/))
- Profile converter: autofdo ([github.com/google/autofdo](http://github.com/google/autofdo))

The file format for sample-based profile information is described at:
[clang.llvm.org/docs/UsersManual.html#sample-profile-format](http://clang.llvm.org/docs/UsersManual.html#sample-profile-format)

Any profile generator or converter tool that can work with this file
format can be used instead of these tools.

Sample-based profiling has less runtime overhead than
instrumentation-based profiling. However, its effectiveness tends to
be directly proportional to the number of samples collected. Thus,
obtaining more accurate sampled profile information requires
collecting larger amounts of sampled profile data.

The following steps explain how to use Linux perf and autofdo to
perform sampling-based PGO.

1. Build the application.

    Use the compile option, -gline-tables-only. For example:

clang++ -gline-tables-only -O2 source.cc -o application
        Copy to clipboard

    The application must be compiled with `-gline-tables-only` (or `-g`) to
ensure that the profile information maps accurately back to the
source code.
2. Generate profile information.

    Use the profile generator, perf, to collect the profile information.
For example:

perf record -e cycles -c 10000 ./application
        Copy to clipboard

    This command generates a profile data file named perf.data.

Note

On most commercial devices, installing perf requires root access.
3. Convert the profile information.

    1. Install the autofdo tool.
    2. Convert the raw profiles into the required sample profile format. For
example:

create_llvm_prof --binary=./application --out=application.profile
            Copy to clipboard
4. Rebuild the application using PGO.

    Enable PGO in the application build by using the profile data
generated in the previous step. For example:

clang++ -O3 -gline-tables-only -fprofile-sample-use=application.profile source.cc -o application
        Copy to clipboard

    The application must be compiled with `-gline-tables-only` to ensure
that the profile information maps accurately back to the source code.

Note

Sample-based profile information can be used even as the
user code changes over time ([Profile resiliency](https://docs.qualcomm.com/doc/80-VB419-99/topic/code_optimization.html#sec-profile-resiliency)).

### Sampling-based PGO on Snapdragon MDP

Snapdragon Mobile Development Platform (MDP) devices are targeted for
application developers, and contain the latest Snapdragon processors
and mobile features. MDP devices additionally include hardware and
software features that specifically support application development.
For detailed information on Snapdragon MDP, go to:

[developer.qualcomm.com/mobile-development/development-devices/mobile-development-platform-mdp](https://developer.qualcomm.com/mobile-development/development-devices/mobile-development-platform-mdp)

One of the MDP developer features is the collection of sample-based
profiles. Typically, a device must be rooted to collect sample data.
However, MDP is preconfigured for this approach, and thus makes
profile collection easy to perform using production applications.

Note

The only additional step necessary is to add the location
of perf to the PATH before using it.

The following procedure explains how to perform sampling-based PGO on
a Snapdragon MDP.

1. Build the application.

    1. Use the compile option `-gline-tables-only`:

clang++ -gline-tables-only -O2 source.cc -o application
            Copy to clipboard
    2. Move the resulting binary file to the MDP.
2. Generate profile information.

    1. Add perf to PATH (perf is pre-installed on a Snapdragon MDP):

export PATH=/data/data/com.qualcomm.qview/:$PATH perf record -e cycles -c 10000 ./application
            Copy to clipboard
    2. Move the generated profile data files back to the host.
3. Convert the profile information.

    1. Install the autofdo tool on the host.
    2. Convert the raw profiles into the required sample profile format. For example:

create_llvm_prof --binary=./application --out=application.profile
            Copy to clipboard
4. Rebuild the application using PGO.

    Enable PGO in the application build by using the profile data
generated in step 3:

clang++ -O3 -gline-tables-only -fprofile-sample-use=application.profile source.cc -o application
        Copy to clipboard

### Profile resiliency

Profile information collected for PGO is associated back to your
source code, and then used to perform PGO. As the user source code
changes over time, LLVM will associate as much of the profile
information with the code as it can. In cases where LLVM cannot
associate the profiles back to source code, a warning message is
generated and the unmappable profile information is ignored. The
compiler then continues associating the profiles for the remaining
parts of the user code.

LLVM profiles are thus quite resilient to changes in the source code.
You can reuse the collected application profiles over time, without
needing to re-profile the application every time. LLVM will continue
using the profiles as best as it can. Over time, as the user code
evolves, the utility of these application profiles will degrade, and
they will need to be refreshed. However, these profile refreshes are
usually proportional to the scale of evolution of the application
code.

### PGO tips

- The benefits of using PGO are closely tied to the quality of the
profiles collected. The profiles should reflect the workloads and
user experience that you are trying to optimize performance for.

    Often, collecting profiles while running automated *correctness*
tests for an application does not adequately exercise the hot loops.
In this case, consider creating tests that specifically target what
you are optimizing for. Improved performance of the final LLVM-
generated binary is usually proportional to how relevant the input
profiles are.
- Ensure that the profiles collected cover the different use cases and
are collected over multiple runs of the same input data set
(especially when using sampling-based PGO). The accuracy of
sampling-based profilers tends to improve as the sample coverage
increases.
- PGO has a greater impact on application performance when compiling at
higher optimization levels, especially if PGO is combined with
link-time optimization (LTO).

    With LTO profile-guided inlining is more powerful because it operates
across module boundaries. With LTO profile-guided indirect call
promotion is enabled. This optimization resolves the frequent targets
for indirect or virtual calls, and thus improves the performance of
applications with indirect or virtual calls.
- Sampling-based profiling requires using the -g or -gline-tables-only
options. These options help LLVM accurately associate the generated
profiles to source code.
- PGO is resilient to changes in the code. The profiles generated can
be reused over time even as the application code changes. LLVM
adjusts and uses the still-relevant profiles, while ignoring the
profiles it deems outdated.
- When using instrumented PGO, the -static linker option, which is used
to build static executables, is not supported.
- When a program is compiled with -fprofile-instr-generate, errno might
not be initially set to zero at the instrumented executable’s
startup.
- On Android targets, PGO dumps data to logcat and requires the log
library to be either statically (-llog is passed to the linker flags)
or dynamically linked.

## Loop optimization pragmas

The compiler supports auto-vectorization pragmas that can be used to
selectively enable and disable the following loop transformations:

Note

The compiler always verifies the correctness of any
transformation, and it will not vectorize a loop unless it can prove
it is safe to do so.

### Pragma syntax

The syntax used for loop pragmas follows the conventions used by the
LLVM community. To add a pragma to a loop, specify the pragma
immediately before the target loop, using the following syntax:

#pragma clang loop <pragma> [...<pragma>]
    Copy to clipboard

For detailed descriptions of the supported loop pragmas listed in the
following table, see [Vectorization pragmas](https://docs.qualcomm.com/doc/80-VB419-99/topic/code_optimization.html#sec-vectorization-pragmas) and
[Reporting](https://docs.qualcomm.com/doc/80-VB419-99/topic/code_optimization.html#sec-reporting).

**Table 5-3 Supported vectorization loop pragmas**

| **Name** | **Description** |
| --- | --- |
| vectorize(enable) | Enable auto-vectorization for a loop. |
| vectorize(disable) | Disable auto-vectorization for a loop. |
| vectorize(assume\_safety) | Enable auto-vectorization for a loop<br>without verifying auto- vectorization<br>correctness. |
| vectorize\_width(N) | Enable auto-vectorization for a loop with<br>the specified vector factor N.<br><br><br>The vector factor is the number of<br>iterations that are executed in parallel.<br><br><br>**NOTE:** The value N must be a power of 2. |

### Compile options

The loop pragmas for auto-vectorization take effect whenever the
auto-vectorization transformations are enabled. These transformations
can be enabled explicitly with a compile option (such as
-fvectorize-loops) or implicitly with an optimization level (for
example, auto-vectorization is enabled at -O3).

As long as the corresponding transformation is enabled, no extra
compile options are necessary to cause loop pragmas to take effect.
To have a loop pragma take effect without enabling the transformation
in general, specify the -floop-pragma option. For example, to
vectorize only a specific loop, add the following pragma to the loop
and compile the file with -floop-pragma:

#pragma clang loop vectorize(enable)
    Copy to clipboard

**Table 5-4 Loop pragma options that enable auto-vectorization**

| **Name** | **Description** |
| --- | --- |
| -fvectorize-loops | Enable auto-vectorization for all eligible<br>loops. |
| -floop-pragma | Enable auto-vectorization for loops<br>specified with an enable pragma. |

The -floop-pragma option enables the compiler to vectorize loops with
enable pragmas.

Currently, -floop-pragma must be used to respect the enable pragmas
when auto-vectorization is not otherwise enabled.

Note

This restriction is expected to be lifted in the future so
enable pragmas can be supported without requiring an additional
compile option.

Following are the command option combinations that can enable
auto-vectorization.

**Table 5-5 Loop pragma option combinations**

| **Combination** | **Description** |
| --- | --- |
| -fvectorize-loops<br>-floop-pragma | Enable auto-vectorization for all<br>eligible loops. |

### Vectorization pragmas

The Snapdragon LLVM compiler supports the following vectorization pragmas:

#pragma clang loop vectorize(enable)
    pragma clang loop  vectorize(disable)
    #pragma clang loop vectorize(assume_safety)
    #pragma clang loop vectorize_width(N)
    Copy to clipboard

Note

These are the same vectorization pragmas that are supported
by the LLVM community compiler.

#### #pragma clang loop vectorize(enable)

Enable vectorization for a loop.

This pragma is useful for vectorizing specific loops that would
otherwise not be considered profitable by the compiler.

##### Example 1: Potential code bloat

The following loop is not profitable because of potential large code
bloat. The pragma overrides the compiler profitability heuristic and
enables auto-vectorization.

void foo(char \*A, char \*B, char \*C, int n) {
       #pragma clang loop vectorize(enable)
       for (int i = 0; i < n; i++) {
          A[5*i] += B[i] \* C[i];
          A[5*i+1] += B[i] + C[i];
          A[5*i+2] += B[i] - C[i];
          A[5*i+3] += B[i] \* C[i];
       }
    }
    Copy to clipboard

##### Example 2: Force auto-vectorization

This pragma can be useful for forcing auto-vectorization of a
specific loop in the loop nest. If you know which loop level is hot,
you can override the default compiler heuristic.

| Specify an outer loop (where n is known to be very large) | Specify an inner loop (where m is known to be very large) |
| --- | --- |
| int A[5000][5000];<br>    int B[5000][5000];<br>    int C[5000];<br>    void foo (int n, int m) {<br>        #pragma clang loop vectorize(enable)<br>        for (int i = 0; i < n; i++)<br>           for (int j = 0; j < m; j++)<br>              A[i][j] = B[j][i];<br>    }<br>    Copy to clipboard | int A[5000][5000];<br>    int B[5000][5000];<br>    int C[5000];<br>    void foo (int n, int m) {<br>       for (int i = 0; i < n; i++)<br>          #pragma clang loop vectorize(enable)<br>          for (int j = 0; j < m; j++)<br>             A[i][j] = B[j][i];<br>    }<br>    Copy to clipboard |

##### Example 3: Ignored loop

If a loop is annotated with this pragma and the compiler proves that
this annotation is illegal, the compiler ignores the loop and sends a
warning.

int foo(char \*A) {
       int e = A[0], s = 0, i = 0;
       #pragma clang loop vectorize(enable)
       while (i != e) {
           e = A[i]; i++;
           s += e;
       }
       return s;
    }
    Copy to clipboard

The following warning is issued for this example:

warning: loop not vectorized: failed explicitly specified loop
vectorization [-Wpass-failed=loop-vectorize]

#### #pragma clang loop vectorize(disable)

Disable vectorization for a loop.

This pragma is used to disable vectorization for a specific loop. It
can be used to avoid vectorizing loops that are not profitable, or to
work around bugs in the vectorizer by not vectorizing loops that are
incorrectly vectorized.

#### #pragma clang loop vectorize(assume\_safety)

Check that a loop is safe for vectorization.

The compiler typically generates runtime legality checks to ensure
safety. The compiler does not generate checks for the loop specified
by this pragma. It is your responsibility to ensure that the loop can
be legally vectorized.

##### Example 1: Pointer aliasing

For vectorization to be done safely, the compiler generates aliasing
checks to ensure that the pointers do not alias each other. If you
know the pointers can never alias, you can specify a restrict keyword
to the pointers or use this pragma. The checks can be expensive if
there are many pointers in the loop.

void foo(char \*A, char \*B, char \*C, int n) {
       #pragma clang loop vectorize(assume_safety)
       for (int i = 0; i < n; i++) {
          A[i] += B[i] \* C[i];
       }
    }
    Copy to clipboard

In this code example, the compiler generates checks to make sure A
does not alias with B or C for successful vectorization. With the
pragma, no checks are generated.

##### Example 2: Data dependence checks

In the following code, the compiler generates checks to ensure that m
is larger than the vector width. With the pragma, no checks are
generated.

void foo1(char \*A, int n, int m) {
       #pragma clang loop vectorize(assume_safety)
       for (int i = 0; i < n; i++) {
           A[i+m] = A[i];
       }
    }
    Copy to clipboard

##### Example 3: Ignored loop

If a loop is annotated with this pragma and the compiler proves that
this annotation is illegal, the compiler ignores the loop and sends a
warning.

int foo(char \*A) {
       int e = A[0], s = 0, i = 0;
       #pragma clang loop vectorize(assume_safety)
       while (i != e) {
           e = A[i]; i++;
           s += e;
           }
       return s;
    }
    Copy to clipboard

The following warning issued for this example:

warning: loop not vectorized: failed explicitly specified loop
vectorization [-Wpass-failed=loop-vectorize]

#### #pragma clang loop vectorize\_width(N)

Set the vector factor used to vectorize a loop.

The vector factor determines how many iterations of a loop are done
in parallel. The vector width must be a power of 2. Invalid vector
widths are ignored. If the vector width is greater than the size of
the vector register, the loop is unrolled until the specified vector
width is reached. For example, if the vector width is set to 16 and
the vector register holds 4 elements, the loop is unrolled 4 times to
achieve the requested vector width.

Setting the vector width to a value greater than 1 adds an implicit
vectorize(enable) pragma to the loop. Setting the vector width to 1
is equivalent to using a vectorize(disable) pragma.

### Reporting

The presence of a loop pragma can have an impact on what reports are
generated for a loop. The compile option -floop-pragma, has no impact
on the reports generated by the auto- vectorizer when
auto-vectorization is enabled. When auto-vectorization is disabled,
-floop- pragma triggers reporting only for loops that have pragmas.

The following table shows the interaction between reporting, options,
and loop pragmas. Y indicates that the option is enabled (either from
the command line or implicitly by the optimization level),
while an X indicates that the option is disabled (either explicitly on
the command line or by not appearing).

**Table 5-6 Loop optimization reporting**

| -fvectorize-loops | -floop-pragma | Report content |
| --- | --- | --- |
| X | X | No reporting |
| X | Y | Report on vectorization results only<br>for loops with enable pragmas |
| Y | X | Report vectorization results only |
| Y | Y | Report vectorization results for all loops |
| X | Y | Report vectorization results only for<br>loops with enable pragmas |
| Y | X | Report vectorization results for all loops |
| Y | Y | Report vectorization results for all loops |

This table assumes that all report data is requested
(-fopt-reporter=all). The reports can be further filtered using the
usual mechanism of passing a specific transformation to the `-fopt-reporter` option.

A new report code has been added for loops that are explicitly
disabled by a loop pragma. If the loop would otherwise be vectorized
but has been disabled by a loop pragma, a *loop failed* report is
generated with a *loop pragma disable* reason code.

### Examples

This section presents a number of examples showing how to use pragmas
and command options to perform loop vectorization. The examples are
not exhaustive; they are intended to show how to achieve specific
results.

#### Vectorize only a specific loop

This example demonstrates how to restrict auto-vectorization to only
act on a specific loop.

##### Command line

clang -Os -floop-pragma
    Copy to clipboard

##### Pragma

#pragma clang loop vectorize(enable)
    Copy to clipboard

##### Example

Typically, vectorization is disabled at `-Os`, but the pragma and
`-floop-pragma` option ensure that the loop is vectorized.

void foo(int \*A, int N) {
       #pragma clang loop vectorize(enable)
       for(int i = 0; i < N; ++i)
          A[i] += 1;
    }
    Copy to clipboard

#### Disable vectorization of a specific loop

This example demonstrates how to disable auto-vectorization of a
specific loop.

##### Command line

clang -mfpu=neon -mcpu=cortex-a57 -Ofast -fvectorize-loops
    Copy to clipboard

##### Pragma

#pragma clang loop vectorize(disable)
    Copy to clipboard

##### Example

The pragma ensures that the loop is not vectorized even though the
`-fvectorize-loops` option is specified on the command line.

void foo(int \*A, int N) {
       #pragma clang loop vectorize(disable)
       for(int i = 0; i < N; ++i)
          A[i] += 1;
    }
    Copy to clipboard

#### Vectorize a non-profitable loop

The auto-vectorizer might decide that a loop is not profitable to
vectorize and disable vectorization of the loop. In this case, a loop
pragma can be used to specifically enable vectorization of the loop.

##### Command line

clang -mfpu=neon -mcpu=cortex-a57 -Ofast -fvectorize-loops
    Copy to clipboard

##### Pragma

#pragma clang loop vectorize(enable)
    Copy to clipboard

##### Example

Enable vectorization for the inner loop. Without the option, the
auto-vectorizer could decide that the loop is not profitable to
vectorize.

void foo (int \*A, int n) {
       for (int j = 0; j < n; j++) {
          int *p = A + 4*j;
         #pragma clang loop vectorize(enable)
            for (int i = 0; i < 4; i++)
               p[i] += 1;
       }
    }
    Copy to clipboard

#### Vectorize a loop with a different vector factor

The auto-vectorizer chooses a vector factor for the loop based on an
internal heuristic. This can be overridden by using a loop pragma.

##### Command line

clang -mfpu=neon -mcpu=cortex-a57 -Ofast -fvectorize-loops
    Copy to clipboard

##### Pragma

#pragma clang loop vectorize_width(16)
    Copy to clipboard

##### Example

Auto-vectorize the loop in function foo, and enforce a vector factor
of 16. Without the pragma, the vectorizer could choose a different
vector factor.

void foo (int *A, int n) {
       #pragma clang loop vectorize_width(16)
       for (int i = 0; i < n; i++)
          A[i] += 1;
    }
    Copy to clipboard

## Optimization reports

Optimization reports are a new compiler reporting mode that can be
used to obtain information on why a loop is not auto-vectorized or
auto-parallelized.

**NOTE:** This feature is under development and is subject to change
in future releases. We encourage you to experiment with this feature
and provide feedback on its usefulness.

The optimization report is a performance tool whose main purpose is
to provide feedback on why a loop could not be vectorized or
parallelized. It is particularly useful when you have a loop you want
to optimize, but the compiler optimizations are not working on the
loop. Using optimization reports, you can learn why the compiler
could not optimize the loop, and possibly take action to enable the
specified optimization.

Using optimization reports to analyze a loop is an iterative process.
There might be multiple reasons why a loop cannot be transformed. The
compiler will only report the first problem it finds with the loop.
After fixing the initial problem, there might be additional problems
with the loop that will be reported (by recompiling the modified
source code), and will need to be fixed before the loop is finally
optimized.

The optimization report extends the community’s LLVM optimization
report for auto-vectorization and auto-parallelization optimizations. The
standard LLVM options for enabling community optimization reports are
described at:

[clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports](http://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports)

To enable loop optimization reporting output from the compiler,
specify the pass name as loop-opt. Two options are used to output the
compiler remarks:

- `-Rpass=loop-opt` outputs the line number of the loops that were
auto-parallelized or vectorized, depending on what optimization is
enabled by the compile options.
- `-Rpass-missed=loop-opt` outputs the line number and the reason why the
loop was not optimized.

### Example output

The following example of an optimization report shows the messages
you will see when a loop is successfully vectorized.

$ cat t.c
    void v1(int *A, int *B, int N) {
       for (int i = 0; i < N; ++i)
          A[i] += B[i];
    }
    
    $ clang -mfpu=neon -mcpu=cortex-a57 -Ofast -c -g -Rpass=loop-opt t.ct.c:2:3: remark: Vectorized loop. [-Rpass=loop-opt]
    for (int i = 0; i < N; ++i)
    ^
    Copy to clipboard

### Optimization report message details

This section describes the most common messages produced by the
compiler. Each message description includes an example of what code
triggers the message, along with potential actions you can take to
avoid the problem and vectorize the loop.

#### Unsupported control flow

The unsupported control flow message indicates that a loop contains
control flow and cannot be vectorized. This is the most common
message you are likely to encounter. All outer and nested loops will
be marked as invalid because of this reason (because they contain an
inner loop, which is control flow). In many cases the control flow in
an inner loop is unavoidable, but sometimes you can rewrite the code
slightly to make it friendlier for the vectorizer.

void foo(int *A, int *B, int N, int c, int d, int e) {
       for (int i = 0; i < N; ++i) {
          if (A[i] < c)
             B[i] += d;
          else if (A[i] > c)
             B[i] += e;
       }
    }
    
    t.c:2:8: remark: Loop body contains unsupported control flow [-Rpass-missed=loop-opt]
    for (int i = 0; i < N; ++i) {
    Copy to clipboard

The control flow could be eliminated by the compiler if there was a
store to B[i] in all cases. In this example, an else clause can be
added, which enables the compiler to remove the control flow and
vectorize the loop:

void foo(int *A, int *B, int N, int c, int d, int e) {
       for (int i =0; i < N; ++i) {
          if (A[i] < c)
             B[i] += d;
          else if (A[i] > c)
             B[i] += e;
          else
             B[i] = B[i];
       }
    }
    t.c:2:3: remark: Vectorized loop. [-Rpass=loop-opt] for (int i = 0; i< N; ++i) {
    Copy to clipboard

#### Non-affine loop bound

The loop optimizer requires all loop bounds to be affine (meaning
that the number of iterations of the loop cannot be analyzed), which
is a linear function of the loop induction variable. If the loop
bound is not affine, the loop is marked as invalid for optimization.

typedef struct S { int a;
       struct S *next;
    } S;
    int foo(S *s) {
       while (s->next != 0) {
          s->a += 1;
          s = s->next;
       }
       return 0;
    }
    t.c:8:5: remark: Failed to derive an affine function from the loop bounds.
    [-Rpass-missed=loop-opt] s->a += 1;
    ^
    Copy to clipboard

The loop bound is non-affine because the compiler cannot analyze how
many iterations the loop will execute ahead of time, because it
depends on the length of the list of S structures. Contrast this case
with a standard for loop (such as (`int i = 0; i < N; ++i){...}`),
where it is known that the loop will execute N times.

void foo(int *A, unsigned int N) {
       for (unsigned i = 0; i < N; i+=2) {
          A[i] += 1;
       }
    }
    t.c:3:5: remark: Failed to derive an affine function from the loop bounds.
    [-Rpass-missed=loop-opt] A[i] += 1;
    ^
    Copy to clipboard

This example shows the problem of using unsigned variables for the
loop index, with a non-unit step. On each iteration, the loop
induction variable increases by two. Because the variable is
unsigned, the C language requires that the value wrap if it reaches
the max unsigned integer value. Because the variable might wrap, it
is impossible for the compiler to compute how many iterations the
loop might execute.

This problem can be fixed by using an int for the loop variable.
Unlike unsigned integers, a plain int has undefined behavior when it
wraps beyond the maximum value. The compiler can exploit this fact to
assume that the value does not wrap, and compute how many times the
loop executes (N/2 in this case).

void foo(int *A, unsigned int N) {
       for (int i = 0; i < N; i+=2) {
           A[i] += 1;
       }
    }
    t.c:2:3: remark: Vectorized loop. [-Rpass=loop-opt] for (int i = 0; i < N; i+=2) {
    ^
    Copy to clipboard

#### Unspecified error

This message is generated in cases where a problem cannot be easily
described in terms of actionable error messages. One example of when
this message is generated is from the complex control flow
surrounding a loop.

int bar();
    void foo(int *A, int N) {
       while(1) {
          while (*A < 10) {
             if (bar())
                (*A++) += 1;
             else
                break;
          }
       if (*A == 100)
          break;
       }
    }
    t.c:3:3: remark: Unspecified error. [-Rpass-missed=loop-opt] while(1){
    ^
    t.c:4:5: remark: Unspecified error. [-Rpass-missed=loop-opt] while (*A < 10) {
    ^
    Copy to clipboard

#### Non-loop-invariant loop bound

This message is generated when the compiler cannot prove that the
loop bound does not change during execution of the loop. You can fix
the problem by hoisting the loop bound computation out of the loop.

int bar(int);
    void n3(int *A, int *B, int N) {
       for (int i = 0; i < bar(N); ++i)
          A[i] += B[i];
    }
    t.c:4:5: remark: Loop bound may change between two different loop iterations.
    [-Rpass-missed=loop-opt] A[i] += B[i];
    ^
    Copy to clipboard

In this example, the loop bound is computed as the return value from
the bar() function. The compiler cannot see the definition of bar(),
so it assumes that it must be computed on each loop iteration. The
fix is to hoist the call out of the loop.

int bar(int);
    void n3(int *A, int *B, int N) {
       int Bound = bar(N);
       for (int i = 0; i < Bound; ++i)
          A[i] += B[i];
    }
    t.c:4:3: remark: Vectorized loop. [-Rpass=loop-opt] for (int i = 0; i < Bound; ++i)
    ^
    Copy to clipboard

#### Inst\_FuncCall

This message is generated when the loop body contains a function
call. You can work around the problem by inlining the function call
into the loop body (if possible).

int inc(int);
    void n5(int *A, int *B, int N) {
       for (int i = 0; i < N; ++i)
          A[i] = inc(B[i]);
    }
    t.c:4:12: remark: This function call cannot be handled. Try to inline it.
    [-Rpass-missed=loop-opt] A[i] = inc(B[i]);
    ^
    Copy to clipboard

If the function body is known, you can either inline the definition
into the loop, or add attribute ((always\_inline)) to the function definition. Here it is
assumed that `inc()` is a simple function that increments its arguments.

void n5(int *A, int *B, int N) {
       for (int i = 0; i < N; ++i)
          A[i] = B[i] + 1;
    }
    t.c:3:3: remark: Vectorized loop. [-Rpass=loop-opt] for (int i = 0; i < N; ++i)
    ^
    Copy to clipboard

#### Base pointer not loop invariant

This message indicates that a pointer used to access memory might
change during execution of the loop. To successfully vectorize a
loop, the compiler depends on having base values that do not move
during the loop. The problem might not always be obvious when
examining the source code, because it could be caused by potential
aliasing of values in the loop.

typedef struct { int **b; } S;
    void foo(S \*A, int N) {
       for (int i = 0; i < N; ++i)
          A->b[i] = 0;
    }
    t.c:6:5: remark: The base address of this array is not invariant inside the loop
    [-Rpass-missed=loop-opt] A->b[i] = 0;
    ^
    Copy to clipboard

In this example, the base value is loaded from the A structure at
each iteration of the loop. The loop can be vectorized if the load of
the base pointer is hoisted out of the loop.

typedef struct { int **b;} S;
    void foo(S *A, int N) {
       int **b = A->b;
       for (int i = 0; i < N; ++i)
          b[i] = 0;
    }
    t.c:6:3: remark: Vectorized loop. [-Rpass=loop-opt] for (int i = 0; i < N; ++i)
    ^
    Copy to clipboard

#### Non-affine memory access

This message indicates that a memory access in the loop is
non-affine, meaning that it is not a linear function of the loop
induction variable. Often, these accesses are the result of double
indirections in the memory access, but they can also arise from
non-linear arithmetic (for example, A[i\*i], A[i%n]).

void n4(int *A, int *B, int N) {
       for (int i = 0; i < N; ++i)
          A[B[i]] += 1;
    }
    t.c:3:5: remark: The array subscript of "A" is not affine
    [-Rpass-missed=loop-opt]
    A[B[i]] += 1;
    ^
    Copy to clipboard

In this example, the double indirection is the problem. The memory
location accessed in the A array is read from the B array, which
makes the access to A non-affine. If possible, try to remove the
double indirection in order to vectorize the loop.

#### Memory alias

This message indicates that the compiler was unable to vectorize the
loop because of aliasing problems with pointers in the loop.
Normally, the compiler will insert runtime checks to disambiguate the
pointers to enable vectorization. However, if there are too many
pointers the runtime checks will not be inserted because the checks
themselves might be more costly than the benefit gained from
vectorizing the loop.

To fix this error, increase the number of allowed runtime checks by
using the `-mllvm-polly-max-pointer-aliasing-checks` option, or by
adding restrict to the pointer parameters that are passed to the function.

void n4(int *A, int *B, int *C, int *D, int *E, int N) {
    for (int i = 0; i < N; ++i)
       A[i] = B[i] + C[i] + D[i] + E[i] + 1;
    }
    t.c:3:5: remark: Accesses to the arrays "B", "C", "D", "E", "A" may
    access the same memory.
    [-Rpass-missed=loop-opt]
    A[i] = B[i] + C[i] + D[i] + E[i] + 1;
    ^
    Copy to clipboard

The compiler reports an aliasing issue with the pointers in the loop.
In this case, the number of runtime checks can be increased by using
the following option to vectorize the loop:

-mllvm -polly-max-pointer-aliasing-checks=5
    Copy to clipboard

Alternatively, restrict can be added to the function parameters to
tell the compiler that the pointers do not alias. Adding restrict is
the preferred fix in this case because it avoids the overhead of
runtime checks and leads to more efficient code.

void n4(int * restrict A, int * restrict B, int * restrict C, int * restrict D, int * restrict E, int N) {
       for (int i = 0; i < N; ++i)
          A[i] = B[i] + C[i] + D[i] + E[i] + 1;
    }
    t.c:2:3: remark: Vectorized loop. [-Rpass=loop-opt]
    Copy to clipboard

Last Published: May 10, 2024

[Previous Topic
Use the compilers](https://docs.qualcomm.com/bundle/publicresource/80-VB419-99/topics/use_the_compilers.md) [Next Topic
Bare metal environment support](https://docs.qualcomm.com/bundle/publicresource/80-VB419-99/topics/bare_metal_environment_support.md)