Loads and Stores - Extending C++ for explicit data-parallel programming via SIMD vector types

mul-1 float_v f(float_v x) {

2 float_v r;

3 for (size_t i = 0; i < float_v::size(); i += double_v::size()) {

4 r[{i, double_v::size()}] = static_cast<float_v>(

5 g(static_cast<double_v>(x[{i, double_v::size()}])));

6 }

7 return r;

8 }

Listing 4.6:Syntax idea for implementing Listing 4.5 via the subscript operator. The ar-gument to the subscript operator consists of a start offset and a length, thus allowing arbitrary partial assignment/extraction. Since the subscript operator only supports a single parameter the argument must be converted from an initializer list.

tipleVector<T>to oneVector<U>or from oneVector<U>to multipleVector<T>.⁶ An even more convenient solution builds upon these casts and theVector<T>type to create aSimdArray<T, N>abstraction for composing multipleVector<T>into one type (cf. Chapter 7).

4.3 loads and stores 41

1 // load member functions

2 void load(const EntryType *mem);

3 template <typename Flags> void load(const EntryType *mem, Flags);

4 template <typename U, typename Flags = UnalignedT>

5 void load(const U *mem, Flags = Flags());

7 // load constructors

8 explicit Vector(const EntryType *mem);

9 template <typename Flags>

10 explicit Vector(const EntryType *mem, Flags flags);

11 template <typename U, typename Flags = UnalignedT>

12 explicit Vector(const U *mem, Flags flags = Flags());

14 // store functions

15 void store(EntryType *mem) const;

16 void store(EntryType *mem, MaskType mask) const;

17 template <typename Flags>

18 void store(EntryType *mem, Flags flags) const;

19 template <typename Flags>

20 void store(EntryType *mem, MaskType mask, Flags flags) const;

21 template <typename U, typename Flags = UnalignedT>

22 void store(U *mem, Flags = Flags()) const;

23 template <typename U, typename Flags = UnalignedT>

24 void store(U *mem, MaskType mask, Flags = Flags()) const;

Listing 4.7:Declaration of the load and store functions.

1 void Vector<T>::load(const U *mem) {

2 for (size_t i = 0; i < size(); ++i) {

3 (*this)[i] = static_cast<T>(mem[i]);

4 }

5 }

6 void Vector<T>::store(U *mem) const {

7 for (size_t i = 0; i < size(); ++i) {

8 mem[i] = static_cast<U>((*this)[i]);

9 }

10 }

Listing 4.8:The semantics of a converting load. A concrete implementation will use vec-tor loads and conversions, of course.

the required load or store operation. (The case of distributed scalars is handled by gather and scatter functions, which are described in Section 4.8.)

4.3.1 converting loads and stores

Some SIMD hardware can convert between different data types without extra run-time overhead when executing a load or store instruction [46]. Therefore, and be-cause it is very convenient for writing portable conversion code, the load & store functions provide a generic variant that can access arrays of different scalar types.

Semantically, these functions behave as described in Listing 4.8. Thus,𝒲T values of typeU are converted with load/store functions inVector<T>, independent of 𝒲_U, in contrast to the converting constructor in Section 4.2.2.

Not all conversions are equally efficient in terms of hardware support. How-ever, for reasons of portability, the full set of conversions between fundamental arithmetic types is made available through these functions.

4.3.2 load/store flags

SIMD hardware makes a difference between aligned and unaligned vector loads and stores (cf. Section 1.3.1). Additionally, most algorithms can be optimized if the developer can hint at the temporal usage of the data.⁷ The alignment can, in theory, be determined from the start address, and thus would not require addi-tional specification in the function call. However, since the alignment can only be determined from the pointer value at runtime, such a check would incur a penalty.

Using unaligned load/store instructions unconditionally would be more efficient than checking the alignment of the pointer. An unaligned load/store instruction in hardware can do the alignment test much more efficiently. Therefore, per default, the load/store functions translate to unaligned load/store instructions.

4.3.2.1 alignment

If the user can guarantee alignment, a tag type can be used as last argument to select the optimized load/store instructions at compile time, without any runtime overhead. It is important that the API is built via atemplateand tag type, rather than a boolean (or enum) function argument. A boolean function argument cannot guarantee compile-time optimization. Especially, such an API would allow pass-ing a non-constant expression as flag variable which cannot be optimized at all.

Via the tag type the user of the API is required to provide a constant expression and thus decide between aligned or unaligned memory access when (s)he writes the code.

4.3.2.2 non-temporal access

Loads and stores can be further optimized for non-temporal accesses. Many data-parallel algorithms use a streaming pattern, where the input and/or output mem-ory locations are used only once. Therefore, this data should not evict other data, which might be used repeatedly in the algorithm, from the CPU caches. The load-/store functions in Vc can therefore be called with theVc::Streaming tag type.

This tag hints to theVector<T>implementation that the data can be moved past the caches. (Most CPUs can use specialized load and store buffers for streaming loads and stores.) If the requested load/store operation cannot be executed as a streaming variant, the implementation will silently fall back to a non-streaming variant.

7 Non-temporal load/store hints tell the CPU that the given memory addresses are referenced only once and memory transfers may bypass the cache hierarchy.

4.3 loads and stores 43 Streaming stores executed with theVc::Streamingtag may use non-globally or-dered stores if the target CPU supports this. Thus, two stores to the same memory location, where at least one is a streaming store, have undefined behavior unless a memory fencing operation is called between the stores. This allows to reach the highest store throughput, but requires a good understanding of the implications when used by a developer.

4.3.2.3 prefetching

A last flag that I implemented for the load/store functions makes prefetching in loops significantly simpler. By adding the Vc::PrefetchDefault tag type, the Vector implementation is asked to emit software prefetch instructions for a tar-get-dependent predefined stride. Thus, a call to

float_v(memory, Vc::Aligned | Vc::PrefetchDefault)

may result in up to three instructions being called, one of which is the load instruc-tion. In addition, prefetch instructions for the lowest level cache and second lowest level cache may be called. These prefetches are called with a predefined offset to thememoryaddress that is passed to the load function.

The prefetch flag is therefore a shorthand to prefetching explicitly in many loops.

However, not all loops require the same prefetch stride lengths, which is why in-stead of the predefined strides the user may also set the strides explicitly. In almost all cases, a developer adds prefetches after the program or component is already working and is only modified for speed optimizations. The developer then deter-mines the prefetch strides through intuition and/or trial and error.

Note that prefetches only need to be called once for any address inside one cache line. Thus, two subsequent loads/stores to neighboring SIMD vectors may result in more software prefetch instructions than necessary. This depends on the ratio of the cache line size to the vector register size. As this ratio is target dependent, the API appears to introduce a portability issue in this case. There is no easy solution from the load/store interface side. However, the compiler is, in theory, able to drop the superfluous prefetch instructions.⁸

8 This is possible if the relative difference between prefetch instructions is considered by the compiler.

It could apply an algorithm that keeps the first prefetch call and drops every subsequent prefetch call that would reference the same cache line as a previous call.

1 Vector &operator++();

2 Vector operator++(int);

3 Vector &operator--();

4 Vector operator--(int);

6 MaskType operator!() const;

7 Vector operator~() const;

8 Vector operator+() const;

9 Vector operator-() const;

Listing 4.9:Declaration of unary operators.

Im Dokument Extending C++ for explicit data-parallel programming via SIMD vector types (Seite 52-56)