measure seems difficult since even the notion of local structural neighbor-hood patterns is hard to grasp. In this chapter, we argue that the spread of probability mass under the node’s most relevant local neighbors is a good characteristic for the node’s role. Similarly to [91] we leverage the Approxi-mate Personalized PageRank (APPR)to effectively describe multiple locality structures around the vertices and use the probability distribution vectors as a basis to quantify the structural roles of the nodes. An important feature of our novel node representation is that it is very efficient to compute and thus, even suitable for large data sets. Furthermore, an important difference to previously published related works, e.g., [82], is that our method operates directly in the vertex domain, though the heat kernel diffusion process re-sembles that implied by PPR [71]. Additionally, our method is not restricted tok-hop neighborhoods.

Our empirical evaluation demonstrates that our simple approach outper-forms somewhat more advanced state-of-the-art role-based node representa-tions. With respect to previously published work on the topic of structural node embeddings (see Section 9.1) we summarize the key contributions of the work presented in this chapter as follows:

• A novel structure-based approach to determine role representations for single nodes directly in the vertex domain as opposed to existing diffusion-based approaches which operate in the spectral domain.

• A fast-to-compute approach that retrieves continuous role representa-tions rather than being composed of multiple, computationally rather costly structural features.

• An extensive evaluation of our proposed role representations that shows promising results when comparing our representations to state-of-the-art node embeddings when using their setups.

### 11.2 Structural Node Representations using

Algorithm 7APPR

Input: Source nodev_{i}, Teleportation probability α, Approximation threshold
Output: APPR vector p_{i}

1: p_{i}= 0,r_{i} =e_{i}

2: whiler_{ij} ≥d_{j} for some vertex v_{j} do
3: pick anyv_{j} wherer_{ij} ≥d_{j}

4: push(v_{j})
5: end while
6: return p_{i}

Algorithm 8push(v_{j})
1: pij =pij+ (2α/(1 +α))rij

2: forv_{k} with(vj, v_{k})∈E do

3: r_{ik}=r_{ik}+ ((1−α)/(1 +α))rij/dj

4: end for 5: rij = 0

### Approximate Personalized PageRank

Personalized PageRank (PPR) can be viewed as a special case of the
PageRank [201] algorithm, where each random walk starts at the same node
vi and at each step there is a chance of α to jump back to vi. The effect of
this modification is that the PageRanks are personalized to the nodev_{i}, i.e.,
they represent the importance of each node from the perspective of the source
node vi. Formally, the PPR-vector πi of node vi is given as the solution of
the linear system

π_{i} =αe_{i}+ (1−α)π_{i}W, (11.1)
where W = D^{−1}A is the random walk transition matrix obtained from the
n×nadjacency matrixAby normalizing the outgoing links of each node by its
out-degree,ei ∈R^{1×n} denotes thei-th unit vector andα is the teleportation
parameter.

The probability of transitioning to a neighborv_{j} from a nodev_{i}is given by
w_{ij}. The entry π_{ij} can then be interpreted as the probability that a random
walk starting at vi and stops at vj. The expected length of a random walk
is determined by the teleportation probability α. With a smaller value, a
larger portion of the graph is explored, while a larger value leads to stronger
localization.

Intuitively, π_{ij} measures the importance of node v_{j} for node v_{i} and the
PPR vector π_{i} as a whole yields a distribution of the node importance in
the neighborhood ofv_{i} where the extension of the neighborhood is controlled
by the parameter α. In particular, the neighborhood is not restricted to
nodes with a maximum hop-distance, such as thek-neighborhood, which may

Algorithm 9APPR-Roles

Input: GraphG, LabelsL, Teleportation probabilitiesαs, Approximation thresh-old

Output: Classification modelm
1: role_descriptors = list()
2: for idx in range(G)do
3: v =G.getNode(idx)
4: emb_{v} = list()
5: forα inαsdo

6: p^{α}_{v} = APPR(v, α, )
7: emb_{v}.append(entropy(p^{α}_{v}))
8: end for

9: role_descriptors.append(emb_{v})
10: end for

11: m= LogisticRegression().fit(role_descriptors,L) 12: return m

contain irrelevant nodes or miss important ones. Compared to the shortest
path distance, nodes with a larger shortest path distance fromv_{i} could still
be more important, e.g., if they can be reached via many different short
paths. For similar reasons, nodes with a small shortest path distance might
not be equally important. Such effects are captured by PPR.

Local push-based algorithms [133, 33] computeApproximate Personalized PageRank (APPR) very efficiently and lead to sparse solutions, where only the most relevant neighbors are taken into consideration [91]. In addition to the teleportation parameter α, the approximation threshold controls the approximation quality and runtime. The main idea is to start with all probability mass concentrated in the source node and then repeatedly push probability mass to neighboring nodes as long as the amount of mass pushed is large enough. In this work, we consider the algorithm proposed in [22].

In particular, we use an adapted version proposed in [234] which converges faster. The procedure is formalized in Algorithm 7 and Algorithm 8.

For a given graph G_{i} and teleportation probability α, we compute the
APPR-vector p^{(α)}_{j} of each node v_{j} ∈ V_{i} and store it as the j-th row of the
sparse n×n APPR-matrix P_{i}^{(α)}.

### Entropy-based Node Descriptors

The APPR-vectorp_{i} of a node v_{i} effectively models the connectivity of that
node with respect to all other nodes in the graph as a probability distribution,
where the probability mass is concentrated only on v_{i}’s relevant neighbors.

This way, the size of the neighborhood is determined by the parameter α.

Figure 11.2: Workflow for calculating the role-based node descriptors.

The APPR-vector pi additionally focuses on the most relevant neighbors by ignoring nodes with small probabilities and thus providing a sparse neighbor-hood representation. In principle, we could use the APPR-vectors directly as node representations. This would lead to the following feature space:

∆^{n} =
(

p∈R^{n}≥0

n

X

i=1

p_{i} = 1
)

, (11.2)

which is known as then-dimensional standard simplex. However, the result-ing representations model homophily rather than structural properties, since they encode the information to which individual nodes a particular source node is connected. In order to make the representations location-invariant, we need to factor out this information. Since location invariance in this case translates to permutation invariance, we consider the quotient space

∆^{n}

∼ ={[p]|p∈∆^{n}}, (11.3)
which corresponds to the set of equivalence classes [p] = {q ∈ ∆^{n} | p ∼ q}

of the equivalence relation ∼ with p ∼ q ⇔ ∃P ∈ P : p = qP where P=

P ∈ {0,1}^{n×n}

P1 = 1, P^{T}1 = 1 is the set of permutation matrices.

As a corresponding quotient map, we can define f : ∆^{n} → ∆^{n}

∼ with
f(p) = pP_{p} which maps p to its equivalence class by sorting it using the
permutation matrixPp such that1≥f(p)1 ≥ · · · ≥f(p)n≥0.

Though the resulting sorted APPR-vectors qualify as structural node de-scriptors, they are not well suited for further downstream tasks since they

are high-dimensional and sparse. Furthermore, node descriptors would not
be comparable among graphs with different numbers of nodes. To this end
we need to perform some form of aggregation. Our approach is based on the
observation that, in terms of APPR, the structural properties of two nodes
differ mostly based on the extent to which they spread their probability mass
throughout the graph. For instance, a community node will spread its
proba-bility mass evenly to nodes within the same community, whereas a peripheral
node will strongly concentrate its probability mass to one or very few nodes
to which it is connected. The above behavior can be accurately described by
the Shannon entropy H : ∆^{n}

∼ → R with H(p) = −Pn

i=1p_{i}logp_{i} where
we use the binary logarithm. In particular, it fulfills the following properties.

Theorem 1. For all p∈ ∆^{n}

∼ it holds that 1. H(p)∈[0,logn].

2. H(p) = 0 if and only if p=e_{1}.
3. H(p) = logn if and only if p= _{n}^{1}1.
4. H(p) = logn−D_{KL}(pk_{n}^{1}1).

Proof. The proofs are straightforward and can be found in [74].

Intuitively stated, properties (1) to (3) state that the entropy is
mini-mized for a distribution with a single peak and maximini-mized for the uniform
distribution. Property (4) states that the entropy can be interpreted as the
similarity to the uniform distribution in terms of the Kullback-Leibler
di-vergence. Our empirical results support the usefulness of this intuition. A
further advantage over other applicable dimension reduction techniques is
that we can describe each node by a single scalar value which can be
visual-ized directly on a color map (as was done in Figure 11.1) and has a simple
and intuitive meaning. Note that the entropy function is symmetric, i.e.,
H(f(p)) =H(p) for all p∈∆^{n}. As a result, the APPR-vectors need not be
sorted and the entropy of a single node v_{i} can be computed in linear time
with respect to the number of non-zero entries of p_{i}.

Recalling that the teleportation parameter α in APPR controls the
ef-fective neighborhood size, we detect roles on multiple scales by computing
APPR for multiple parameter values α∈ {α_{1}, . . . , α_{l}} and concatenating for
each node its correspondingl entropy values. The final descriptor of nodevi

is then given as

f_{i} =h
H

p^{(α}_{i} ^{1}^{)}

, . . . , H

p^{(α}_{i} ^{l}^{)}iT

∈R^{l}. (11.4)

The entire procedure for calculating the role descriptors is sketched in
Figure 11.2 and can be summarized as outlined in Algorithm 9. The input
for the method is the entire graph dataset G, a list of node labels L, a list
of teleportation parametersαs, and the approximation threshold . For each
of the nodes v in G, the algorithm stacks the entropy-based representations
of the corresponding APPR vectors, denoted as p^{α}_{v}, to generate the role
descriptor ofv, i.e., embv. The algorithm finally fits a classification model on
the collection of role descriptors and retrieves the resulting model for node
classification.