Other Work on SIMD Types - Extending C++ for explicit data-parallel programming via SIMD vector

3.5 OTHER WORK ON SIMD TYPES

Since its first public release of the Vc library in 2009 there have been several new projects to abstract SIMD types very similar to the basic idea discussed here. The main difference in the interface abstraction of Vc to the other libraries is that Vc ab-stracts data-parallel programming as independent from𝒲Tas possible. The other libraries encode𝒲T in the type name or encode the target SIMD instruction set in the type name.⁶

3.5.1 boost.simd

The Boost.SIMD library⁷abstracts SIMD objects via thepack<T, N>class template [18, 17].N may be omitted to compile for the native𝒲T of the target. This makes pack<T> almost equal toVc::Vector<T>except for the ABI incompatibility issue ofpack<T> discussed in Chapter 6. The Vc mask type is available in Boost.SIMD aspack<logical<T>, N>, thus tying the mask API to the value vector API, which I specifically chose differently for the design of Vc (Section 5.2). Compare oper-ators of pack<T> return bool, instead of the mask type for Boost.SIMD, requir-ing predicate functions to do vectorized compares instead. The implementation of write-masking is not as generic as in the Vc API: Boost.SIMD provides the if_-else function that implements a vector blend and special functions likeselinc andseldecto execute write-masked increment and decrement. The Vc API builds these expressions with standard C⁺⁺operators instead (Section 5.3).

In contrast to the Vc operators, Boost.SIMD uses expression templates with the operators forpack. This enables the library to do code transformations on longer expressions, such as fusing multiplications and additions/subtractions or reorder elementwise operations. Vc instead relies on the compiler optimizer to do these kind of optimizations, because expression templates can increase compile times significantly and make diagnostic output from ill-formed programs unreadable.

Also from experience with Vc, a good optimizing compiler nowadays generates optimal machine code in almost all uses. The few remaining issues are solvable

“missed-optimization” issues in the compiler. The important benefit of the Vc ap-proach is that the vectorization quality does not depend on how many temporary values are captured in variables, which is a major “gotcha” with expression-tem-plate solutions.

6 Vc also encodes the SIMD instruction set in the type name, but it is an internal type name. The user-visible types are target-agnostic.

7 Boost.SIMD is not in boost yet. This is the intention of the developers, though.

3.5.2 other libraries The VCL Library [25] and the Generic SIMD Library [83] are two more implemen-tations of C⁺⁺wrapper libraries around SIMD intrinsics. They are competing im-plementations for the ideas presented here and in earlier publications such as Kretz [55] and Kretz et al. [59].

Part II

Vc: A C⁺⁺ Library for Explicit Vectorization

4

A DATA-PARALLEL TYPE

Programs must be written for people to read, and only incidentally for machines to execute.

— Harold Abelson et al. (1996) The SIMD vector class shall be an abstraction for the expression of data-parallel operations (cf. Section 3.1). If the target architecture of a compilation unit does not support SIMD instructions, but similar data-parallel execution, the expressed data-parallelism shall be translated accordingly. The following list states the de-sired properties for such a type:

• The value of an object ofVector<T> consists of𝒲T scalar values of typeT.

• ThesizeofandalignofofVector<T>objects is target-dependent.

• Scalar entries of a SIMD vector can be accessed via lvalue reference.

• The number of scalar entries (𝒲T) is accessible as a constant expression.

• Operators that can be applied toTcan be applied toVector<T>with the same semantics per entry of the vector. (With exceptions, if type conversions are involved. See below.)

• The result of each scalar value of an operation onVector<T>does not depend on𝒲T.¹

• The syntax and semantics of the fundamental arithmetic types translate di-rectly to theVector<T> types. There is an additional constraint for implicit type conversions, though:Vector<T>does not implicitly convert toVector<

U> if𝒲T ≠𝒲U for any conceivable target system.

1 Obviously the number of scalar operations executed depends on𝒲T. However, the resulting value of each scalar operation that is part of the operation onVector<T>is independent.

1 namespace Vc {

2 namespace target_dependent {

3 template <typename T> class Vector {

4 implementation_defined data;

6 public:

7 typedef implementation_defined VectorType;

8 typedef T EntryType;

9 typedef implementation_defined EntryReference;

10 typedef Mask<T> MaskType;

12 static constexpr size_t MemoryAlignment = implementation_defined;

13 static constexpr size_t size() { return implementation_defined; }

14 static Vector IndexesFromZero();

16 // ... (see the following Listings)

17 };

18 template <typename T> constexpr size_t Vector<T>::MemoryAlignment;

20 typedef Vector< float> float_v;

21 typedef Vector< double> double_v;

22 typedef Vector< signed int> int_v;

23 typedef Vector<unsigned int> uint_v;

24 typedef Vector< signed short> short_v;

25 typedef Vector<unsigned short> ushort_v;

26 } // namespace target_dependent

27 } // namespace Vc

Listing 4.1:Template class definition forVector<T>.

A concrete implementation for SSE2 could call the inner namespaceSSE^{, use}

the intrinsic types__m128,__m128d, and__m128iforVectorType, a union for thedatamember,T &forEntryReference,^𝒮^VectorType_𝒮

T forsize(), and

𝒮VectorType forMemoryAlignment.

• The compiler is able to identify optimization opportunities and may apply constant propagation, dead code elimination, common subexpression elimi-nation, and all other optimization passes that equally apply to scalar opera-tions.²

4.1 THE VECTOR<T> CLASS TEMPLATE

The boilerplate of the SIMD vector class interface is shown in Listing 4.1. There are several places in this listing where the declaration says “target-dependent” or

“implementation-defined”. All of the following listings, which declare functions of theVector<T> class (to insert on line 16), do not require any further implementa-tion-specific differences. All these differences inVector<T>are fully captured by the code shown in Listing 4.1.

2 In practice, there are still a few opportunities for compilers to improve optimization of SIMD oper-ations.

4.1 the vector<t> class template 33 The only data member of the vector class is of an implementation-defined type (line 4). This member therefore determines the size and alignment ofVector<T>. Therefore, the SIMD classes may not contain virtual functions. Otherwise, a virtual table were required and thus objects of this type would be considerably larger (larger by the minimum of the pointer size and the alignment ofVectorType).

4.1.1 member types

The member types ofVector<T>abstract possible differences between implemen-tations and ease generic code for the SIMD vector types.

VectorType

(line 7) is the internal type for implementing the vector class. This type could be an intrinsic or builtin type. The exact type that will be used here depends on the compiler and compiler flags, which determine the target instruction set. Additionally, if an intrinsic type is used it might not be used directly (on line 4) but indirectly via a wrapper class that implements compiler-specific methods to access scalar entries of the vector.

TheVectorTypetype allows users to build target- and implementation-spe-cific extensions on top of the predefined functionality. This requires a func-tion that returns an lvalue reference to the internal data (line 4). See Sec-tion 4.10 for such funcSec-tions.

EntryType

(line 8) is always an alias for the template parameterT. It is the logical type of the scalar entries in the SIMD vector. The actual bit-representation in the SIMD vector register may be different toEntryType, as long as the observ-able behavior of the scalar entries in the object follows the same semantics.

EntryReference

(line 9) is the type returned from the non-const subscript operator. This type should be an lvalue reference to one scalar entry of the SIMD vector. It is not required forEntryReference to be the same asEntryType &. Consider an implementation that uses 32-bit integer SIMD registers forVector<short>, even though ashortuses only 16 bits on the same target. Then EntryRefer-encehas to be an lvalue reference toint. IfEntryReferencewere declared asshort &then sign extension to the upper 16 bits would not work correctly on assignment.

MaskType

(line 10) is the mask type that is analogous toboolfor scalar types. The type is

used in functions that have masked overloads and as the return type of com-pare operators. A detailed discussion of the class for this type is presented in Chapter 5.

4.1.2 constants

size() The vector class provides a static member function (size()) which iden-tifies the number of scalar entries in the SIMD vector (line 13). This value is deter-mined by the target architecture and therefore known at compile time. By declar-ing thesize()variableconstexpr, the value is usable in contexts where constant expressions are required. This enables template specialization on the number of SIMD vector entries in user code. Also it enables the compiler to optimize generic code that depends on the SIMD vector size more effectively. Thesize()function additionally makes Vector<T> implement the standard container interface, and thus increases the reusability in generic code.³

MemoryAlignment TheMemoryAlignmentstatic data member defines the align-ment requirealign-ment for a pointer passed to an aligned load or store function call of Vector<T>. The need for aMemoryAlignment static data member might be sur-prising at first. In most cases the alignment ofVector<T> will be equal to Mem-oryAlignment. However, as discussed in Section 4.1.1, implementations are free to use a SIMD register with different representation of the scalar entries than En-tryType. In such a case, the alignment requirements forVector<T>will be higher than𝒲T ×𝒮T for an aligned load or store. Note that the load and store functions allow converting loads (Section 4.3.1). These functions need a pointer to memory of a type different thanEntryType. Subsequently the alignment requirements for these pointers can be different. Starting with C⁺⁺14 it may therefore be a good idea to declareMemoryAlignmentas:

template <typename U>

static constexpr size_t MemoryAlignment = implementation_defined;

IndexesFromZero() TheIndexesFromZero()function (line 14) returns a Vec-tor<T>object where the entries are initialized to the successive values{0, 1, 2, 3, 4, …}. This constant is useful in many situations where the different SIMD lanes need to access different offsets in memory or to generate an arbitrary uni-form offset vector with just a single multiplication.

3 It would suffice to define only thesize()function and dropsize(). Personally, I prefer to not use function calls in constant expressions. Additionally, a 50% difference in the number of characters makessize()preferable because it is such a basic part of using SIMD types.

Im Dokument Extending C++ for explicit data-parallel programming via SIMD vector types (Seite 39-47)