**4. Mathematical and Statistical Description of the Adjustment Problem**

**4.3 Numbering Schemes and Reordering**

4.3. Numbering Schemes and Reordering 39

Index Vector for Reordering A first quantity to derive is the so called index vectori_{p}pp_{f}7→ppp_{t}, the
entry at position i contains the index, the coefficient pppt(i) can be found inpppf, and thus pppt(i) =
p

p

p_{f}(ippp_{f}7→ppp_{t}(i)). Given a matrix Ainppp_{f}, its rows are reordered topppt via

A_{p}ppt =A_{p}ppf(i_{p}pp_{f}7→ppp_{t},:), (4.23)

its columns via

Apppt =Appp_{f}(:,ippp_{f}7→ppp_{t}), (4.24)

and for quadratic matrices rows and columns via

Apppt =Apppf(ippp_{f}7→ppp_{t},ippp_{f}7→ppp_{t}), (4.25)
using the well known MatLab/Octave like notation. The index vectori_{p}pp_{f}7→ppp_{t} can be computed with
the descriptive Alg. 4.1. which works for both cases, i.e. pppf⊆pppt orpppt⊆pppf.

Note that if ppp_{f} ⊆ ppp_{t}, the matrix to be reordered will be extended by 0 rows and/or columns to
obtain the correct size ofpppt.size(), the index vector for this parameters not originally contained
in pppf are set to values corresponding to these 0 rows and/or columns at the end of the matrices
(variable ein Alg. 4.1).

For the case pppt ⊆ pppf, the index vector is extended to the dimension of pppf. The last entries i.e.

p p

p_{f}.size()toppp_{t}.size()−1are set to the indices of the entry itself to guarantee, that the coefficients
not contained in pppt are reordered to the end of the matrix. Applying the reordering to a matrix,
these last rows/columns are deleted after the reordering.

Although Alg. 4.1 can be easily understood, it is not very efficient due to the “find” operation, i.e.

a search operation in a vector which might have millions of entries for huge dimensional parameter spaces. An alternative, i.e. a more efficient way can be performed via sorting the numbering scheme.

This much faster algorithm is summarized in Alg. 4.2. It reduces the complexity from O(n^{2}/2)to
O(nlog(n)). For a test case of two numbering schemes of520 000parameters, the runtime is reduced
from 706 sto 0.2 s.

Permutation Vector for Reordering An alternative notation/operation, which is better suited
for the block-cyclic distributed matrices, is a so called permutation or pivoting vector. In contrast
to an index vector, where the column and or row interchanges are assumed to be performed
si-multaneously, a permutation vector (at least as used here) contains a sequence of serial row and
column permutations i.e. a sequential swapping of two rows/columns starting at begin (index 0)
of the vector. In contrast to the index vector, already performed swapping operations are taken
into account in the representation. An entry in the permutation vector at position i means that
the row i is swapped with row ψψψ(i). To be more precise, the current content of row/column i is
swapped with the current content of row/column ψψψ(i). Note that the content might change with
every swapping operation. Now, the old entry of position i is in row ψψψ(i), thus the index vector
needs to be updated. A remaining entry i in the subsequent elements ofippp_{f}7→ppp_{t} has to be replaced
by the entryψψψ(i). The procedure to convert an index vector to a permutation vector is summarized
in Alg. 4.3.

A more efficient but less obvious algorithm is shown in Alg. 4.4. The basic idea is to avoid the search
operation via introducing a second vector which stores the position of an entrykin the vector. This
reduces the complexity fromO(n^{2}/2)to O(2n). The runtime is reduced from 59 s to 0.01 s, for an
example index vector with 520 000entries.

To apply a permutation vector to rows/and or columns, the operator ΨΨΨppp_{f}7→ppp_{t}(·) is defined. This
operator performs the serial permutations of rows (ΨΨΨ^{r}_{p}_{p}_{p}

f7→ppp_{t}) as given by the vector ψψψ_{p}pp_{f}7→ppp_{t}, the
operator ΨΨΨ^{c}_{p}_{p}_{p}

f7→ppp_{t} performs the column interchanges and ΨΨΨ^{r,c}_{p}pp_{f}7→ppp_{t} performs the interchanges for rows
and columns.

Algorithm 4.1: Simple version to compute an index vector from two symbolic numbering schemes.

Data:

NumberingSchme ppp_{from} Symbolic numbering scheme source matrix is ordered in
NumberingSchme ppp_{into} Symbolic numbering scheme matrix should be reordered to
// initialization of index vector

1

vector<size_t> i_{p}_{p}_{p}

from7→ppp_{into}(ppp_{into}.size(),0 )

2

// start value for fill in indices for parameters inppp_{into} but not inppp_{from}, inserted at the end

3

size_te=ppp_{from}.size()

4

// loop over all parameters inppp_{into}

5

fork= 0 toppp_{into}.size() do

6

// find index of parameterppp_{into}(i) inppp_{from}

7

i=find(ppp_{from}.begin(), ppp_{from}.end(), ppp_{into}(i) )

8

// if parameter found insert index i, otherwise fill in value outside ofppp_{from}.size()

9

if i<ppp_{from}.size() then

10

i_{p}_{p}_{p}_{from}_{7→p}_{p}_{p}_{into}(k) =i

11

else

12

i_{p}_{p}_{p}_{from}_{7→p}_{p}_{p}_{into}(k) =e

13

e+ +

14

end

15

end

16

// special case ifpppinto ⊆pppfrom: extend index vector to size ofpppfrom 17

if pppinto.size()<pppfrom.size() then

18

i_{p}_{p}_{p}

from7→ppp_{into}.resize(pppfrom.size())

19

// the remaining parameters are sorted to the end as they are not contained inpppinto,

20

// applied to a matrix, these last rows/columns will be removed

21

fork=ppp_{into}.size() toppp_{from}.size() do

22

i_{p}_{p}_{p}_{from}_{7→p}_{p}_{p}_{into}(k) =k

23

end

24

end

25

returni_{p}_{p}_{p}

from7→ppp_{into}// index vector performing reordering from ppp_{from}toppp_{into}

26

Example Tab. 4.1 gives a small scale example of reordering and permutation for two symbolic numbering schemes. Index and permutation vector are provided for both directions. In addition, the first steps of a sequential permutation are characterized.

4.3.3 Reordering of Block-cyclic Distributed Matrices

If the permutation vector between two numbering schemes is known, these permutations have to be applied to block-cyclic distributed matrices to reorder the matrices to the target numbering scheme.

The reason why the permutation vector was introduced is, that there exist a SCALAPACK helper
routine, which is internally used for pivoting during the solution of a system of equations. The
routinepdlapiv can be directly used to perform the reordering of a distributed matrix, where the
input is a permutation vector (The usage of pdlaswp is also possible). The routine applies the
permutation as given by ψψψppp_{f}7→ppp_{t} to either rows or columns. Applying the function twice, first to
permute rows and secondly to permute columns, both are reordered. Beside the standard input
of the block-cyclic distribution of the matrix, the function requires the input of the permutation
vector in a special form. The permutation vector (consisting of integers only) needs to be passed as
a block-cyclic distributed vector of integers, such that the local entryψψψ_{p}^{l}_{p}_{p}

f7→ppp_{t}(i) contains the global
row/column, the local row/columnihas to be swapped with. The serial integer vector is brought to
a block-cyclic distributed vector as introduced for the block-cyclic distributed matrices in Sect. 3.3.

4.3. Numbering Schemes and Reordering 41

Algorithm 4.2: Fast version to compute an index vector from two symbolic numbering schemes.

Data:

NumberingSchme ppp_{from} Symbolic numbering scheme source matrix is ordered in
NumberingSchme ppp_{into} Symbolic numbering scheme matrix should be reordered to
// initialization of index vector

1

vector<size_t> i_{p}_{p}_{p}

from7→ppp_{into}(ppp_{into}.size(),0 )

2

// start value for fill in indices for parameters inppp_{into} but not inppp_{from}, inserted at the end

3

size_te=ppp_{from}.size()

4

// store current index of parameter in helping atribut_iof each individual parameter

5

p p

p_{from}.setIndex()

6

// sort numbering schme with implemented sort function (operator<)

7

p p

p_{from}.sort()

8

// loop over all parameters inppp_{into}

9

fork= 0toppp_{into}.size() do

10

// find index of parameterppp_{into}(i) inppp_{from} in sorted numbering scheme

11

i=find(ppp_{from}.begin(), ppp_{from}.end(), ppp_{into}(i) )

12

// if parameter found, insert indexi, otherwise fill in value outside ofppp_{from}.size()

13

if i<ppp_{from}.size() then

14

i_{p}_{p}_{p}_{from}_{7→p}_{p}_{p}_{into}(k) =ppp_{from}.p(i).i()

15

else

16

i_{p}pp_{from}7→ppp_{into}(k) =e

17

e+ +

18

end

19

end

20

// special case ifppp_{into}⊆ppp_{from}: extend index vector to size ofppp_{from}

21

if ppp_{into}.size()<ppp_{from}.size() then

22

i_{p}_{p}_{p}

from7→ppp_{into}.resize(ppp_{from}.size())

23

// the reamaining parameters are sorted to the end as they are not contained inpppinto,

24

// applied to a matrix, these last rows/columns will be removed

25

fork=pppinto.size() toppp_{from}.size() do

26

i_{p}pp_{from}7→ppp_{into}(k) =k

27

end

28

end

29

returni_{p}_{p}_{p}_{from}_{7→p}_{p}_{p}_{into}// index vector performing reordering from ppp_{from} toppp_{into}

30

The reordering operations are implemented in the member functions reorder, reorderCols and reorderRows of theDistributedMatrixclass in Listing 3.2.

Fig. 4.1 gives an overview about the required runtime for the reordering of rows and columns of
distributed matrices of different dimension on different quadratic processor grids (for the distribution
parameters default values ofb_{r} =b_{c}= 64were used) to get an idea of the order of magnitude. The
index vector was randomly generated (random shuffle of an index vector). The main conclusions are:

i) the reordering of columns is much faster than the reordering of rows (factor of three to ten). This could be expected as the column access in memory is much faster using the column major order for matrices (cf. Sect. 2.2.1) for the locally stored matrices. ii) for matrices of dimension lower than 20 000×20 000 the reordering is performed in less than 1 s on all grids. For the reordering of columns, this even holds for matrices smaller than 80 000×80 000. Although there is no real scaling behavior of the reordering operations with the number of cores (cf. Fig. 4.1(b)), the most important thing is that the performance increases on larger compute core grids and does not drop to additional organizational requirements (at least for matrices above dimension10 000×10 000the scaling is above 1.0 for all cases analyzed).

Algorithm 4.3: Simple version to convert an index vector to a permutation vector.

Data:

NumberingSchme i_{p}pp_{from}7→ppp_{into} index vector to be converted to permutation vector
// initialization of permutation vector

1

vector<size_t> ψψψppp_{from}7→ppp_{into} =i_{p}_{p}_{p}_{from}_{7→p}_{p}_{p}_{into}

2

size_tp= 0;// loop over entries of index vector

3

fork= 0 to i_{p}_{p}_{p}

from7→ppp_{into}.size() do

4

// find current index in the subsequent part of index vector

5

p=find(ppp_{from}.begin()+k, ppp_{from}.end(), k)

6

// if index found, replace it by its new positionψψψppp_{from}7→ppp_{into}(k)(after swapping)

7

if found then

8

ψψψppp_{from}7→ppp_{into}(p) =ψψψppp_{from}7→ppp_{into}(k)

9

end

10

end

11

returnψψψppp_{from}7→ppp_{into}// sequential permutation vector corresponding to i_{p}_{p}_{p}_{from}_{7→p}_{p}_{p}_{into}

12

10^{+2} 10^{+3} 10^{+4} 10^{+5}

10^{−2}
10^{−1}
10^{+0}
10^{+1}
10^{+2}

# rows and columns of matrix to be reordered (R=C)

measuredruntimeforreordering(s)

reordering of rows (grid: 16×16 = 256) reordering of columns (grid: 16×16 = 256) reordering of rows (grid: 24×24 = 576) reordering of columns (grid: 24×24 = 576) reordering of rows (grid: 32×32 = 1024) reordering of columns (grid: 32×32 = 1024) reordering of rows (grid: 48×48 = 2304) reordering of columns (grid: 48×48 = 2304)

(a) Mean absolute runtime measured for the reordering.

100^{+2} 10^{+3} 10^{+4} 10^{+5}

2 4 6 8 10

# rows and columns of matrix to be reordered (R=C)

scalingbehaviornormalizedto256cores[]

reordering of rows (grid: 24×24 = 576) reordering of columns (grid: 24×24 = 576) reordering of rows (grid: 32×32 = 1024) reordering of columns (grid: 32×32 = 1024) reordering of rows (grid: 48×48 = 2304) reordering of columns (grid: 48×48 = 2304)

(b) Scaling behavior of reordering operations.

Figure 4.1: Runtime analysis of the row (?) and column (◦) reordering operations (index vector randomly generated). The colors represent different dimensions of the processor grid.

The scaling is normalized to 256 cores and given for every matrix dimension.