• Keine Ergebnisse gefunden

each thread grabs a chunk of iterations from a queue, as soon as it has finished the previous work, until all iterations have been handled.

1 int j = 2;

2 /* Up-Sweep Phase */

3 for (int i = 1; i < N_values; i = j) { 4 j = i << 1;

5 #pragma omp parallel for schedule(dynamic) 6 for (int m = j - 1; m < N_values; m += j) { 7 array[m] = array[m - i]+array[m];

8 }

9 }

10 /* Down-Sweep Phase */

11 for (int i = j >> 1; i > 1; k = j) { j = i >> 1;

12 #pragma omp parallel for schedule(dynamic) 13 for (int m = i - 1; m < N_values - j; m += i) { 14 array[m + j] = array[m]+array[m + j];

15 }

16 } 17

Listing 6.2: The parallel implementation of prefix scan algorithm with OpenMP.

Similar to the initial stage in the final stage the algorithm processes hits in parallel and removes used hits from consideration for the next stages. In addition to this, the grid structure needs to be updated accordingly.

groups of starting hits, formed during the initialization stage.

This stage of the CA track finder algorithm is intrinsically parallel and works locally with respect to the data. Due to this fact, there was no need to drastically redesign this part. However, several crucial optimizations had to be done in order to reveal the parallelism, which was hidden by the technical issues.

One particular example of inappropriate memory usage for a parallel run is the way triplets were stored in the memory in the original algorithm version. The constructed triplet was saved by allocating memory for each new object again and afterwords storing this object to the array of triplets, by calling thepush back() function of the standard library C++ vector:

1 L1Triplet triplet;

2 triplet.Set( ihitl, ihitm, ihitr, istal, istam, istar, 3 0, qp, chi2); /* Set triplet parameters */

4 vTriplets.push_back(triplet); /* Store the triplet to array */

5

Listing 6.3: The sequential implementation of storing the constructed triplets.

While it is a correct and efficient procedure in the case of sequential run, in a parallel implementation this piece will work neither correctly, nor efficiently.

The first reason for the wrong results is that several threads possibly try to store triplets to the same array at once. Parallel implementation requires so-called thread-safe execution, which means that for each thread one needs to provide a separate array to store the data in order to avoid conflicting threads.

1 L1Triplet & tr = TripletsLocal[omp_get_thread_num][triplet_num++];

2 tr.Set( ihitl, ihitm, ihitr,

3 istal, istam, istar,

4 0, qp, chi2); /* Set triplet parameters */

5

Listing 6.4: The parallel implementation of storing the constructed triplets.

The second reason for an inefficient parallel execution of the example is that in this case the array of triplets will grow gradually and at some point there will not be enough space to store all the array elements in the initial location. In

this case the CPU will have to move the whole array to a new memory location.

This time-consuming procedure will happen many times throughout the program execution and will ruin the algorithm potential for parallelism.

In order to solve this issue, the memory for the constructed triplets must be allocated in advance, similar to the way as it was done for the input hit information. The proposed solution is shown in List. 6.4. This way the CPU does not have to repeatedly perform the data relocation, since from the beginning it has enough memory to save all the triplets to be built during the algorithm execution. This issue was solved for several arrays used to store the output information during the triplet building stage.

One more optimization was introduced to the triplet building stage in order to reduce the number of memory accesses. In the initial event-based track finder version, there was a special function, which was performing the task of finding and storing the neighboring relations between the constructed triplets. This procedure was done in a loop over the array of triplets, after all of them had been built. Since the triplets are built in a loop over starting hit stations, one can notice that at the point, when the algorithm builds triplet on a certain station N, its potential neighboring triplets, which should start on the station (N −1) have already been built. Thus, all the information needed in order to define the neighboring relations between triplets is available at the point the algorithm accepts a certain triplet. The idea was to shift this task into the triplet building stage, so that it is done on the fly in the triplet building loop. This way we get rid of additional loop over all constructed triplets.

According to the definition neighboring triplets are those ones that share two hits in common and coincide within certain errors in the momentum. The fol-lowing scheme was chosen in order to implement the search for neighbors on the fly while building triplets. A special array with the size of hits was introduced (List. 6.5). The link to each approved triplet is stored in the array according to its starting hit. Having this structure, the only thing one needs to do in order to obtain all the neighbors for a certain triplet T is to take the triplets starting with the middle hit of T and check whether their next hit also coincides with the the hit of the supposed triplet. Also the momenta of both triplets should coincide with each other within estimated errors.

1 L1Triplet & T=TripletsLocal[omp_get_thread_num][triplet_num++];

2 TripletsStartWithHit[T.GetLHit()].push_back(&T); /* Store the link to the triplet for a corresponding starting hit */

3

4 for ( int i=0; i < TripletsStartWithHit[T.GetMHit()].size(); ++i ){

5

6 L1Triplet* &Neighbour = TripletsStartWithHit[T.GetMHit()][i];

7 const fscal &qp2 = Neighbour->GetQp();

8 fscal &Cqp2 = Neighbour->Cqp;

9

10 if (Neighbour->GetMHit() != T.GetRHit()) continue; /*

Check for the 2nd common hit */

11 if ( fabs(T.qp - qp2) > (T.Cqp + Cqp2 ) continue; /* Check for momenta to coincide within errors */

12 T.neighbours.push_back(Neighbour);

13 } 14

Listing 6.5: The implementation of defining neighboring triplets on the fly.

With the above-mentioned modifications the triplet building stage can be run in parallel by a number of independent threads with no synchronization needed.

As an output each thread provides an array of constructed triplets together with their neighboring relations.