• Keine Ergebnisse gefunden

6. Application: Gravity Field Determination from Observations of the GOCE Mission

6.3 Gradiometry NEQ Assembly in a HPC Environment

Algorithm 6.3: Setup of the regularization matrix for the higher degree coefficients.

Data:

θ0 opening angle of polar gap in radians

ppp symbolic numbering scheme for the result matrix lstart_highdegree start degree of the regularization

// Loop over all parameters

1

for i= 0toppp.size() do

2

Nhighdeg(i, i) = 0

3

nhighdeg(i) = 0

4

// if coefficient iis a spherical harmonic coefficient (sine or cosine)

5

if (ppp(i).type() == SH_C)||(ppp(i).type() == SH_S)then

6

l=ppp(i).l()

7

m=ppp(i).m()

8

// if the degree is above the start degree and the order is not affected by the gap

9

if (l >=lstart_highdegree) &&(m >=θ0l)then

10

Nhighdeg(i, i) = 1010l4

11 12 end

end

13

end

14

L¯ = ¯L(K)

15

returnNhighdeg, nhighdeg// Regularization matrix and zero RHS

16

along the compute core grid, (ii) the design matrix for the second derivative of the potential in the GRF has to be set up as a block-cyclic distributed matrix, (iii) the design matrix and the observation vector have to be filtered, i.e. decorrelated and (iv) finally the update of the NEQs i.e.

the computation of A¯Tg,sg,s has to be performed.

For simplicity, only the implementation for a single segment s and a single tensor component g is described. However, the procedure is the same for every segments and every tensor componentg. Thus, within this chapter, the subscripts s and g are neglected in the equations. The objective of this section is to describe the massive parallel implementation of the computation ofNg,sincluding the four steps mentioned above. As every segment can consist of millions of observations, denoted as Ms, the segments are subdivided into blocks b (b ∈ {0,1, . . . , Ms/bobs}) of an arbitrary block size bobs and are processed in a block-wise approach. Within every step, Ng,s is updated for bobs

observations.

6.3.1 Distribution of the Observations Along the Compute Core Grid

Within a loop, the design matrix forbobsobservations is set up for allU parameters. Thus, an empty distributed matrix A of dimension bobs ×U is set up in a first step. After that, depending on the dimension of the compute core grid and the parameters of the block-cyclic distribution, on each core the local matrixAlr,cof dimensionRlr,c×Cr,cl exists. Its dimension, depending on the coordinates of the processor, can be obtained via (3.5). Whereas the parameters are distributed over the columns of the compute core grid, the observations are distributed over the rows of the compute core grid.

Hence, all local matricesAlr,∗of a certain rowrof the compute core grid contain entries for the same observations, but for different subspaces of the parameters. All local matricesAl∗,c along a column c of the compute core grid contain as columns the same subspace of parameters, but for a different subset of the observations. Thus, the observations i.e. the gravity gradients of component g plus the rotation matrix and the satellites position are distributed over the rows of the compute core grid. Every row of the compute core grid processes the same observations but only for a subspace of the parameters. For the chosen decorrelation concept, i.e. the decorrelation by digital filters, it is

6.3. Gradiometry NEQ Assembly in a HPC Environment 67

important to guarantee a chronological ordering of the observations. As the sparse filtering matrices (cf. Sect. 6.2.2.2) contradict the concept of the block-cyclic distribution, an alternative filtering is implemented (cf. Sect. 6.3.3). This local filtering which requires the local part of the observation vector ```lr,c (and the rows of the local part of the design matrices Alr,c) to be sorted chronological and containing equidistant consecutive observations. As the order of the rows of the global matrix A does not matter for the final computation ofATA, an ordering of the observations in the global matrix views of A and``` tailored for the filtering is chosen. The observations are distributed such that not the global view on the matrix A produces chronological ordered equidistant observations but the virtually composed matrices

Aˆ =







Al0,0 Al0,1 Al0,2 · · · Al0,C−1 Al1,0 Al1,1 Al1,2 · · · Al1,C−1 Al2,0 Al2,1 Al2,2 · · · Al2,C−1

... ... ... ... ...

AlR−1,0 AlR−1,1 AlR−1,2 · · · AlR−1,C−1







andˆ```=







```l0,0

```l1,0

```l2,0 ...

```lR−1,0







. (6.18)

Thus, the filter operations can be performed row by row of the compute core grid sequentially, but with limited communication, as a) the observations in each individual local matrix are equidistant and sorted chronological and b) all local matrices are chronological ordered along of a column of the compute core grid and are equidistant. To achieve this distribution for the observation vector and the meta information (satellite position and orientation in space), bobs observations are distributed along a column of the compute core grid. Assuming the number of rows of the local matrix to be Rlr,c(computed with (3.5)), the firstRl0,∗of thebobs are send by the master process to the processors of the first processor grid’s row, the next Rl1,∗ observations are send to the processes of the second processor grid’s row and so on. The last processor row gets the lastRlR−1,∗ observations of the total bobs observations. The data are distributed via a sequence of MPI send operations. Afterwards, each process has the observations to be processed and the local observations on each processes are sorted chronological and are equidistant in time. Fig. 6.4 illustrates the distribution with as simple example. A chronological sorted vectortof dimension9×1is distributed over a3×1compute core grid withbr = 2. The chosen distribution is compared to the standard distribution, i.e. distributing the chronological sorted OEQs following the block-cyclic distribution.

6.3.2 Assembly of the Design Matrices

Within the next step, the design matrix has to be set up in the GRF according to the procedure summarized in Sect. 6.2.2.1. The resulting matrix should be directly the distributed matrix A, each processor has to fill its local part Alr,c for its part of the observations and its sub-space of the parameters. The following steps are performed sequentially for every observation the process is responsible for. All processes of the compute core grid perform the setup in parallel. Processes of the same column of the compute core grid perform the steps for different observations (in a global view, local rows ofA, i.e. rows ofAlr,c) but for the same parameter subset. The processes of a same row of the compute core grid perform the steps for different subset of the overall parameters (in the global view, local columns ofA, i.e. column ofAlr,c) but for the same observations. The only computations which might be redundant on some cores is the recursive computation of Legendre functions which are computed recursive for every spherical harmonic order. The amount of redundant computations depends on the target numbering scheme of A.

Within the first step, a serial 6×Cr,cl matrix is set up containing the design matrix of the gravity gradient for all six tensor components in the LNOF for theCr,cl local parameters, i.e. the parameters corresponding to the locally stored columns of the local part of the design matrix Alr,c. This local matrix is then rotated into the GRF applying tensor rotations (cf. Sect. 6.2.2.1). The rows of this

t0

t1

t2

t3

t4

t5

t6

t7

t8

tl0,0 =

tl1,0 = t=

tl2,0 = t0

t1

t6

t7

t2

t3

t8

t5

t4

chronologicalunsortedovercomputecoregrid

given result

global view, sorted time-wise

chronological unsorted, local matrices,

block-cyclic distribution distribute to

(a) Standard block-cyclic distribution of a chronological sorted vector. The vector is distributed as a block-cyclic distributed matrix. Not used for the SGG observa-tion equaobserva-tions!

tl0,0 =

tl2,0 = t0

t1

result given

local matrices, chronological sorted,

chronologicalsortedovercomputecoregrid

t0

t1

t= t4

t5

t7

t8

t6

t2

t3

chronological unsorted, global view,

distribution

t2

tl1,0 = t3

t4

t5

t8

t7

t6

collect from block-cyclic

(b) The local matrices are chronological filled. The global view (collected) block-cyclic distributed matrix is not chronological sorted. Used for SGG observa-tion equaobserva-tions!

Figure 6.4: Distribution of the GOCE SGG observation equations as a chronological sorted block-cyclic distributed matrix and the tailored distribution for the filtering.

6×Cr,cl matrix correspond to the tensor components g∈G. The row associated with the gradient component A should be computed for, is copied to the ith row of Alr,c, such that the distributed matrix is filled observation per observation. In addition, the processes of the first compute core grids columnc= 0 copy the observations to their local matrix of the observation vector```lr,c. Alr,cand```lr,c are now available in a distributed form, having a row-wise ordering which is suited for the next step which is the filtering. It is important to realize that the observations are not sorted chronological in the view on the global matrixA.

6.3.3 Applying the Decorrelation by Recursive and Non-Recursive Digital Fil-ters

The implementation of the decorrelation is the most technical step within the procedure. From the implementational point of view a compromise between a general and flexible implementation and an efficient approach has to be found. The challenges are the correlations between all observations of a segment s and the recursive part of the used decorrelation filter model. The used sequence of non-recursive and recursive filters is able to model correlations along thousands to hundreds of thousand of observations. As, due to memory limitations, As,g can never be set up at once, even distributed over the compute core grid, only parts of the OEQs with a size bobs can be set up at the same time. Thus, the filtering needs to be implemented as a block-wise filtering, accounting for correlations in the past from previous blocks. In addition, the rows of the design matrix with only bobsrows are already distributed in blocks over the compute core grids rows (and columns), as for the OEQs the concept of block-cyclic distributed matrices is used. Thus, an additional block filtering is required. In summary, there is a twice nested block filtering. This is furthermore complicated by the fact that recursive filters are used. As an input, they require the already filtered foregoing parts

6.3. Gradiometry NEQ Assembly in a HPC Environment 69

of the time series. Last but not least, the filters used are cascaded filters, and thus this steps have to be repeated K times for every cascadek of the filter.

6.3.3.1 Implementation of the Non-Recursive Filter

As the filtering of different columns of a matrix can be independently performed, the given procedure is the same for all columns of the compute core grid. The filtering is performed totally in parallel by different columns of the compute core grid. Hence the focus is only on a single column of the compute core grid, e.g. for the columnc= 0. First of all, it is assumed that the filter only consists of a single cascade. For the descriptive development of the algorithm a compute core grid with only R= 3 rows is assumed. The first step of the filtering is the application of the non-recursive part, i.e. mathematically written for the chosen distribution of the OEQs cf. (6.12c) the computations of

 A¯l0,0l1,0l2,0

=Fβk

 Al0,0 Al1,0 Al2,0

=

Fβk0,0 0 0 Fβk1,0 Fβk1,1 0

0 Fβk2,1 Fβk2,2

 Al0,0 Al1,0 Al2,0

 (6.19a)

=

Fβk0,0Al0,0 Fβk1,0Al0,0+Fβk1,1Al1,0 Fβk2,1Al1,0+Fβk2,2Al2,0

, (6.19b)

illustrating the band-matrixFβk as block diagonal matrix plus an additional second diagonal block.

Note that the matricesFβkr,care no local matrices in terms of the block-cyclic distributed matrices, but a serially stored matrix with subscripts just identifying the block in the overall matrix Fβk. If the serial filter matrices Fβkr,c are locally stored on the cores as full general matrices, time-invariant filter matrices with a Toeplitz structure as well as time-variant more general filters can be handled. The filtering can then be performed via matrix-matrix operations (e.g. BLAS Level 3). In the following, the implementation is discussed for time-invariant filters with constant coefficients, cf. (6.14). This representation is possible, as due to the used filter order (typically 50 to 300, 5000 for special cases), it can be always achieved that Rr,cl > Pk, e.g. via the choice ofbobs.

More general, each process(r,c)can locally compute the filtered part of its local design matrix (part of the observation vector) via

lr,c=Fβkr,r−1Alr−1,c+Fβkr,rAlr,c (6.20)

if it shares the dataAlr−1,c orFβkr,r−1Alr−1,cwith the process(r−1,c) (row above in compute core grid). The filter matrices are set up serially for each process. For the special case r=0and for the first block b = 0 of dimension bobs of a segment there is no data Alr−1,c, i.e. data from the past.

For this case Alr−1,c = 0 has to be assumed. For the case r =0 and b > 0, AlR−1,c from the last blockb−1 is used as data from the past. This matrix has to be kept in memory. This operations, i.e. two matrix-matrix multiplications can be performed with fast BLAS-L3 routines serially. But, Fβkr,r and Fβkr,r−1 are high dimensional ( Rlr,c×Rr,cl andRlr,c×Rlr−1,c, respectively) and thus the serial computation is time-consuming. Using standard filters both matrices are sparse, as they are banded matrices asPk< Rlr,c(for time-variant and time-invariant filters). Eq. (6.19a) is illustrated by Fig. 6.5 to show the properties of the computations. The first obvious idea is to represent the involved matrices as triangular matrices, thus instead of the BLAS-L3 matrix-matrix multiplication (dgemm) the triangular matrix-matrix multiplication can be used (dtrmm). The second diagonal matrices Fβkr,r−1 are even sparser, as only the upper triangular of dimension Pk−1×Pk−1 is non-zero. Thus process (r,c) does only need the last Pk−1 rows of Alr−1,c or of Fβkr,r−1Alr−1,c. (6.20) is rewritten as

lr,c=

"

FβkAl r−1,c 0

#

+Fβkr,rAlr,c (6.21)

β0k βk0

βk0 βk0

βk0 βk0

βk0 βk0

β0k β0k

β0k βk0

βk0 βk0

βk0 βk0

βk0 β1k

βk1 βk1

βk1 βk1

βk1 βk1

βk1 β1k

β1k β1k

βk1 βk1

βk1 βk1

βk1 β2k

βk2 βk2

βk2 βk2

βk2 βk2

βk2 β2k

β2k β2k

βk2 βk2

βk2 βk2 β3k

βk3 βk3

βk3 βk3

βk3 βk3

βk3 β3k

β3k β3k

βk3 βk3

βk3

0

0

0 0

0

0 0

0

=

l0,c

l2,cl1,c

Al2,c Al1,c Al0,c

Figure 6.5: Graphical depiction of the non-recursive filtering of the block-cyclic distributed matri-ces. The shape of involved matrices is shown. A time-invariant filter with constant coefficients is assumed (Toeplitz band matrix).

defining the sub-matrices as

Al r−1,c:=Al r−1,c(end−Pk+ 1 :end,:) (6.22a)

Fβk :=Fβkr,r−1(0 :Pk−1, end−Pk+ 1 :end). (6.22b)

Each process sets up its local matricesAlr,c, sends its sub-matrixAl r,cto process(r+1,c), performs the filtering Fβkr,rAlr,c and receives his Al r−1,c to update the filtered vector by the remaining part FβkAl r−1,c. The last client of the compute core grids column, i.e. (R−1,c) sends his matrix Al r,c again to the client(0,c)as it is needed for the next block loopb+1, where the nextbobsobservations of the same segment are processed.

This first, not very efficient but quite simple and massive parallel implementation of the non-recursive filtering of distributed design matrices is summarized in Alg. 6.4. The implementation has the advantages, that it is very flexible and can for example handle time-variant filters where the coefficients may vary in time.

The local computation of the productFβkr,rAlr,con each process can be accelerated using the special properties of the time-invariant filter matrix. For the case of real data, the number of local rows Rr,cl on every process is in the order of several thousand observations. Thus, compared to the order Pk of used filters (which is typically 2 to 500 for real data processing, depending on the cascade k) the number of rows is large. The larger the imbalance between Pk and Rlr,c, the sparser the filter matrix. Fβkr,r contains many zeros on each core. The product, to be compute serially on each process is visualized in Fig. 6.6. The filter matrix Fβkr,r can be partioned into P ×P triangular matrices, Foβ the light grey ones and Fuβ dark grey ones. Also composing the matrix to be filtered into blocks of P rows, e.g. Al,jr,c cf. Fig. 6.6, the product of that matrices can be rewritten as

l,jr,c=Al,jr,cFuβ+Al,j−1r,c Foβ. (6.23)

Two main advantages are: i) Now small “full” triangular matrices are involved in the computations and ii) the filter matrices to be stored are only triangular of dimension P×P (storable in a single P×P matrix) and are now constant for the whole loop over all blocksb. This alternative filtering can be efficiently implemented in place via again using the BLAS-L3 routine for triangular matrix-matrix products (dtrmm) which is capable to operate on sub-matrices as well. This is important

6.3. Gradiometry NEQ Assembly in a HPC Environment 71

Algorithm 6.4: Application of the non-recursive filter to a distributed matrix.

Data:

vector<double> βββ vector ofP filter coefficients DistributedMatrix A distributed matrix to be filtered

int b number of block of sizebobs to be processed // determine my process coordinates and compute core grid dimension

1

size_tr,c, R,C

2

blacs_gridinfo(A.context(),r,c,R,C)

3

// determine size of my local part of A

4

size_tRl=A.Rl()

5

size_tCl=A.Cl()

6

// send lastP1 rows ofAl to next processor row

7

Matrix A =A.localMat(endP+ 1 :end,:)

8

A .Send((r+1)%R,c)

9

if r==0then

10

// InitializeA with zeros if member of the first processor grids row, and b== 0

11

if b == 0 then

12

A =0

13

else

14

// Ifb >0, take memory from last roundb1, i.e. A

15

A =A

16

end

17

// Save the very lastP1rows for possible next block bobs observationsb+ 1

18

Matrix A .Recv(R1,c)

19

else

20

A .Recv(r1,c)

21

end

22

// set up serial filter matrix of size Rl×Rl according to (6.14)

23

Matrix Fβ=setUpFilterMatrix(βββ)

24

// apply filter in place using BLAS-L3 dtrmm

25

A.localMat() =FβA.localMat()

26

// update firstP1 rows of with missing part from the past

27

// set up filter matrix of sizeP1×P1 according to (6.22b)

28

Matrix Fβ =setUpFilterMatrix(βββ)

29

A.localMat()(1 :P1,:)+ =FβA

30

returnA// as in place filtered matrix

31

for the last block j, as there might be sub-matrices of Foβ and Fuβ involved and for accessing the sub-matrices Al,j. The implementation can only be performed in place, if the loop starts at the bottom of the vector. The unfiltered values from the past are needed, which are then stored in the beginning of the vector and are not overwritten with the filtered values. The extended, more efficient but technical implementation of the filtering is summarized in Alg. 6.5.

6.3.3.2 Implementation of the Recursive Part of the Filter

Having applied the non-recursive part of the filter of cascadek, the recursive part of the cascade k has to be applied. A recursion generally contradicts a massive parallel implementation – if the time series is distributed and not locally stored in the memory of a single core. Within the recursion, the filtering step needs the already filtered observations from the past (cf. (6.11)). Instead of a real massive parallel implementation, a simple but flexible parallelization was chosen. As for the non-recursive part, the recursive filtering is written as a simple example (R = 3 and shown only for one column of the compute core grid). As for the non-recursive filtering, the recursive filtering

Algorithm 6.5: Application of the non-recursive filter to a distributed matrix in extended form.

Data:

vector<double> βββ vector ofP filter coefficients DistributedMatrix A distributed matrix to be filtered

int b number of block of size bobs to be processed // determine my process coordinates and compute core grid dimension

1

size_tr, c,R,C

2

blacs_gridinfo(A.context(),r,c,R,C)

3

// determine size of my local part ofA

4

size_tRl=A.Rl()

5

size_tCl=A.Cl()

6

// send lastP1 rows ofAl to next processor row

7

Matrix A =A.localMat(endP+ 1 :end,:)

8

A .Send((r+ 1)%R,c)

9

if r==0then

10

// Initialize A with zeros if member of the first processor grids row, andb== 0

11

if b == 0 then

12

A =0

13 14 else

// Ifb >0, take memory from last roundb1, i.e. A

15

A =A

16

end

17

// Save the very lastP1 rows for possible next block bobs observationsb+ 1

18

Matrix A .Recv(R1,c)

19

else

20

A .Recv(r1,c)

21

end

22

// set up triangular filter matrices of sizeP×P according to Fig. 6.6

23

Matrix Fuβ,Foβ=setUpFilterMatrix(βββ)

24

// apply filter in place, starting from the back, triangular matrix matrix product (dtrmm)

25

// rest block smaller thenP

26

int k=Rl%P

27

A.localMat(endk:end,:) =Fuβ(0 :k1,1 :k1)A.localMat(endk:end,:)

28

A.localMat(endk:end,:)+ =Foβ(0 :k1,:)A.localMat(endkP :endP,:)

29

// Loop over the full blocks exept the firstP

30

forb=rlk1P toP1 do

31

A.localMat(b:b+P1,:) =FuβA.localMat(b:b+P1,:)

32

A.localMat(b:b+P1 :end,:)+ =Foβ(0 :k1,:)A.localMat(bP1 :b,:)

33

end

34

// Processing of the first block (onlyFuβ needed)

35

A.localMat(0 :P1,:) =FuβA.localMat(0 :P1,:)

36

// update firstP1 rows with missing part from the past

37

// set up filter matrix of sizeP1×P1 according to (6.22b)

38

Matrix Fβ =setUpFilterMatrix(βββ)

39

A.localMat(1 :P1,:)+ =FβA

40

returnA// as in place filtered matrix

41

6.3. Gradiometry NEQ Assembly in a HPC Environment 73

βk0 βk0

βk0 β0k

βk0 βk0

βk0 β0k

βk0 βk0

β0k β0k

βk0 βk0

β0k βk0

βk0 βk1

βk1 βk1

β1k βk1

βk1 βk1

β1k βk1

βk1 β1k

β1k βk1

βk1 β1k

βk1 βk2

βk2 βk2

β2k βk2

βk2 βk2

β2k βk2

βk2 β2k

β2k βk2

βk2 β2k βk3

βk3 βk3

β3k βk3

βk3 βk3

β3k βk3

βk3 β3k

β3k βk3

βk3 βk4

βk4 βk4

β4k βk4

βk4 βk4

β4k βk4

βk4 β4k

β4k βk4

=

¯

0

Al,3r,cl,2r,cl,1r,cl,0r,c

Al,3r,c Al,2r,c Al,1r,c

0

Al,0

r,c

Figure 6.6: Graphical depiction of the serial part of non-recursive filtering on a single process. Spe-cial focus is on the shape and the partioning of the involved matrices. Shown is the case of a non-recursive time-invariant filter.

of different columns of the compute core grid remains independent. Thus, different columns of the compute core grid perform the filtering in parallel. For the first block b= 0with bobs observations the filtering is done via the solution of the system of equations (cf. (6.12d)), again representing the banded matrix Fαk as a blockdiagonal matrix including a second diagonal

 Al0,0 Al1,0 Al2,0

=Fαk

 A¯l0,0l1,0l2,0

=

Fαk0,0 0 0 Fαk1,0 Fαk1,1 0

0 Fαk2,1 Fαk0,0

 A¯l0,0l1,0l2,0

 (6.24a)

=

Fαk0,0l0,0 Fαk1,0l0,0+Fαk1,1l1,0 Fαk2,1l1,0+Fαk1,1l2,0

. (6.24b)

Alr,care the local matrices to be filtered and are A¯lr,c the filtered output matrices.

More general, on a single core, the operations

Alr,c=Fαkr,c−1lr−1,c+Fαkr,rlr,c (6.25a)

Alr,c−Fαkr,c−1lr−1,c=Fαkr,rlr,c (6.25b)

lr,c=solve

Fαkr,r,Alr,c−Fαkr,c−1lr−1,c

(6.25c) A¯lr,c=F−1αkr,r

Alr,c−Fαkr,c−1lr−1,c

(6.25d) have to be performed. As visible in (6.25c), process (r,c) needs the already filtered output of process (r−1,c) to solve the local filtering. The filtering is implemented only partially in parallel, to obtain a flexible implementation. Only the processes of different columns c of the compute core grid perform the filtering in parallel.

After the local filter operation, the filtered results are send to the next rowr+1of the compute core grid, which themselves had to wait until(r−1, c) finished the filtering. The matrix to be filtered is reduced by the product Fαkr,c−1lr−1,c and the system of equations has to be solved. The product can be rewritten, as compared to the non-recursive part of the filter as again Fαkr,c−1 is sparse.

Only the upper triangle is filled with Q−1 diagonals (comparable to Fig. 6.5, demonstrated in Fig. 6.7).

αk0 αk0

αk0 αk0

αk0 αk0

αk0 αk0

αk0 αk0

αk0 αk0

αk0 αk0

αk0 αk0

αk0 αk1

αk1 αk1

αk1 αk1

αk1 αk1

αk1 αk1

αk1 αk1

αk1 αk1

αk1 αk1

αk1 αk2

αk2 αk2

αk2 αk2

αk2 αk2

αk2 αk2

αk2 αk2

αk2 αk2

αk2 αk2 αk3

αk3 αk3

αk3 αk3

αk3 αk3

αk3 αk3

αk3 αk3

αk3 αk3

αk3 αk4

αk4 αk4

αk4 αk4

αk4 αk4

αk4 αk4

αk4 αk4

αk4 αk4 αk3

αk2 αk3 αk4 αk2 αk3

αk3 αk1

αk4 αk4 αk4

=

0

0

Alr,clr,c

A¯

lr1,c(end−P,:)

Figure 6.7: Graphical depiction of the serial part of recursive filtering on a single process and the partioning of the involved matrices.

Instead of definingFαkr,c−1, we can define the (on all processes) constant triangular matrix (for the case of time-invariant filters)

Fα :=





αQ αQ−1 · · · α1 0 αQ · · · α2

... 0 αQ ...

0 0 0 αQ



 (6.26)

of dimension Q×Q. Thus, only the first Q rows ofAlr,c have to be reduced by

∆ ¯Alr−1,c: =Fαlr−1,c(end−Q:end,:), (6.27a)

=FαA , (6.27b)

using the defined sub-matrix (already filtered values from the past)

A := ¯Alr−1,c(end−Q:end,:). (6.28)

After the reduction, the system of equations is solved (cf. Fig. 6.7). This can be performed via the solution of a triangular system (forward substitution, LAPACK routinedtrsvm) or via the solution of a banded system which also exists in the LAPACK library for serial matrices (dgbsv). Alg. 6.6 summarizes the (partially) parallel distributed filtering for the recursive part of the filter. Again, the implementation is flexible, and can be extended to filter coefficients changing in time very easily.

6.3.3.3 Application of Cascaded Filters

For general filters, consisting of multiple cascades, the non-recursive and recursive parts are applied consecutive for every cascade. For all cascadesk∈ {0, . . . , K−1}first the non-recursive part, and afterwards the recursive part is applied to the matrices to be filtered. The procedure is shortly summarized in Alg. 6.7 and more precise later on in Alg. 6.8, summarizing the whole procedure of GOCE gradiometry NEQ assembly.

Two final things on the filtering have to be mentioned. As can been seen from the algorithms above, the filtering of the very first observations of every segment can only be performed approximative, as there are no observations from the past. Instead, zeros are assumed for the matrices Alr−1,c im (6.20) andA (cf. (6.28)). As this results in unreasonable filtered values, the first observations of a

6.3. Gradiometry NEQ Assembly in a HPC Environment 75

Algorithm 6.6: Application of the recursive filter to a distributed matrix.

Data:

vector<double> ααα vector ofQfilter coefficients DistributedMatrix A distributed matrix to be filtered

int b number of block of sizebobs to be processed // determine my process coordinates and compute core grid dimension

1

size_tr,c, R,C

2

blacs_gridinfo(A.context(),r,c,R,C)

3

// determine size of my local part of A

4

size_tRl=A.Rl()

5

size_tCl=A.Cl()

6

if r==0then

7

// InitializeA with zeros if member of the first processor grids row, and b== 0

8

if b == 0 then

9

A =0

10

else

11

// Ifb >0, take memory from last roundb1, i.e. A

12

A =A

13

end

14

// Save the very lastQ1 rows for possible next blockbobs observationsb+ 1

15

Matrix A .Recv(R1,c)

16

else

17

// Blocking Receive, wait until matrix completely received

18

A .Recv(r1,c)

19

end

20

// set up filter matrix of sizeQ×Qaccording to (6.26)

21

Fα =setUpFilterMatrix(ααα)

22

// reduce first Q1 rows with preceding filtered rows

23

A.localMat(endQ:end,:)=FαA

24

// set up filter matrix of sizeRl×Rl according to (6.13)

25

Fα=setUpFilterMatrix(ααα)

26

// solve system of equations in place, either as band-system or as triangular system (dtrsvm,dgbsv)

27

Al=solve(Fα,Al)

28

// send lastQ1 rows ofAl to next processor row

29

B=A.localMat(endQ+ 1 :end,:)

30

B.Send(r+1%R,c)

31

returnA// as in place filtered matrix

32

segment cannot be used within the adjustment. The number of observations which are not properly decorrelated depends on the properties of the filter. This is called warmup phase (e.g. Schuh, 2003) of the filter, the length of the warmup phase and thus the amount of observations affected, can be numerically determined (e.g. Siemes, 2008, p. 74). These observations are removed always at the start of a new segment after the filtering and not assembled into the final NEQs.

Processing a long time-series in blocks bof size bobs requires the so called filter memory, especially when using cascaded filters. After each runb, where all cascades are applied to the observations of that block, the unfiltered very last P −1 observations for the non-recursive filter and the filtered Q−1already filtered observations have to be memorized. They serve within the next loop iteration where the next bobs observations are processed as information from the past. Within Alg. 6.5 and 6.6, this is nothing else than the matrices A of the last compute core grids row, i.e. the processes (R−1,c). They are send again to the processors(0,c), to use them asA within the next loop pass b+1. These matrices have to be stored for every cascade individually, to have the correct information of the past observations at the correct filtering status of the observations before. Within Alg. 6.5 and 6.6 these matrices are stored as A on the processors(0,c).