Auto-Vectorization - Extending C++ for explicit data-parallel programming via SIMD vector types

2.2.1 uses for auto-vectorization

It is the job of a compiler to create the best machine code from a given source. The executable code should make use of the target’s features (in terms of instruction set and micro-architecture) as much as is allowed under the “as-if” rule of C⁺⁺[48,

§1.9 p1]. Therefore, auto-vectorization is an important optimization step. Many codes can benefit without any/much extra effort from the developer’s side. This is important for low-budget development or developers without experience in par-allel programming. It can also be used for highly optimized codes. Though, in this case explicit loop-vectorization should be preferred, as the implicit expression of parallelism is fragile in terms of maintainability.

2.2.2 inhibitors to auto-vectorization

Intel documented the capabilities and limitations of the ICC⁵ auto-vectorizer in detail [4]. They point out several obstacles and inhibitors to vectorization:

Non-contiguous memory accesses: Non-contiguous memory accesses lead to inefficient vector loads/stores. Thus the compiler has to decide whether the computational speedup outweighs the inefficient loads/stores.

Data dependencies: Data dependencies between the iteration steps make it hard or impossible for the compiler to execute loop steps in parallel.

Countable: If the loop is not countable the compiler will not vectorize.

Limited branching: Branching inside a loop leads to masked assignment. In-stead of actual branches in the machine code, the auto-vectorizer has to emit code that executes all branches and blend their results according to the con-dition on each SIMD lane. If the compiler determines that the amount of branching negates the improvement of vectorization it will skip the loop.

Outer loops: Outer loops (loops containing other loops) are only vectorized af-ter certain code transformations were successful.

Function calls: Function calls inhibit vectorization. However, functions that can be inlined and intrinsic math functions are exceptions to this rule.

5 Intel C⁺⁺Compiler

2.2 auto-vectorization 19 Thread interaction: Any calls to mutexes, or atomics inhibit

auto-vectoriza-tion.

Auto-vectorization in GCC[5] additionally documents that for some SIMD trans-formations the order of arithmetic operations must be modified. Since this kind of optimization can change the result when applied to floating-point variables it would deviate from the C⁺⁺standard. The standard specifies that operators of equal precedence are evaluated from left to right. Consequently,Auto-vectorization in GCC[5] recommends the-ffast-mathor-fassociative-math flags for best auto-vectorization results.

2.2.3 limitations of auto-vectorization

In most cases, auto-vectorization cannot instantly yield the best vectorization of a given algorithm (if at all). This is in part due to the limited freedom the compiler has in transforming the code, such as aliasing issues, fixed data structures, point-er alignment, function signatures, and conditional statements. Anothpoint-er important part is that the user did not express the inherent data-parallelism explicitly. The compiler has to infer information about the problem that got lost in translation to source code. Finally, the user has to limit himself to a subset of the language as listed in Section 2.2.2.

If a user wants to vectorize his/her code via auto-vectorization, it is necessary to let the compiler report on its auto-vectorization progress and use this informa-tion to adjust the code for the needs of the vectorizer. Sometimes this can require larger structural changes to the code, especially because data storage must be trans-formed from arrays of structures to structures of arrays. Additionally, users should add annotations about the alignment of pointers to improve vectorization results.

2.2.3.1 aliasing

Auto-vectorization relies on iterations over arrays to express the per-iteration in-put and outin-put data. Since structures of arrays work best for the vectorizer, the loop body typically dereferences several pointers to the same fundamental data type. The compiler must account for the possibility that these pointers are equal or point to overlapping arrays. Then, any assignment to such an array potentially alters input values and might create cross-iteration dependencies.

float add_one(float in) { return in + 1.f;

}

Listing 2.2:A simple function.

float *data = ...

for (int i = 0; i < N; ++i) { data[i] = add_one(data[i]);

}

Listing 2.3: A function call in a loop.

2.2.3.2 fixed data structures

A major factor for the efficiency of a vectorized algorithm is how the conversion from 𝒲T scalar objects in memory to a single vector register and back is done.

The compiler has no freedom to improve the data layout, which is fully defined by the types used by the algorithm. At the same time, the scalar expression of the algorithm conceals the problem from the developer who often is oblivious to the limitations of the vector load and store operations of the hardware.

2.2.3.3 pointer alignment

In most cases the compiler cannot deduce whether a given pointer uses the neces-sary over-alignment for more efficient vector load and store instructions. Without extra effort, and an understanding of hardware details, the user will therefore cre-ate a more complex vectorization of the algorithm than necessary.

2.2.3.4 function signatures

Consider a simple function like Listing 2.2. A function defines an interface for data in- and/or output. There are fixed rules how to translate such code to a given ABI⁶. For example thein parameter has to be stored in the bits 0–31 of register xmm0 on x86_64 Linux. Unless the compiler is allowed/able to inline the function, the function itself cannot be vectorized. Neither can a calling loop, such as Listing 2.3, be vectorized in this case.

In theory, the above limitation can be solved with LTO⁷. With LTO the compiler has more opportunity to inline functions since it still has access to the abstract syntax tree of the callee when the optimizer vectorizes the caller loop.

6 Application Binary Interface 7 Link Time Optimization

Im Dokument Extending C++ for explicit data-parallel programming via SIMD vector types (Seite 30-33)