Algorithm 11Ada-LLD

Input: Graph G= (V, E), Label matrix Ytrain, Approximation threshold ,
Teleportation parameters {α_{1}, . . . , α_{k}}

Output: Trained classifier f

1: // Compute label distributions at each scale

2: declareX ∈R^{k×n×l}

3: for α_{j} ∈ {α_{1}, . . . , α_{k}} do

4: declare Xαj ∈R^{n×l}

5: for v_{i} ∈V do

6: ld_{i} ←Compute_LD(v_{i}, α_{j}, ) (see Algorithm 10)

7: Xαj[i,:] =ldi

8: end for

9: X[j,:,:] =X_{α}_{j}

10: end for

11: // Train classifier with SGD

12: f =H_{2}(H_{1}(X))

13: f ←SGD(f, Ytrain, `)

14: return f

by simply multiplying an embedding matrix E ∈ R^{n×d} (where d is the
di-mensionality of the embedding) to the pre-processed adjacency matrix Aˆas
in [142]. Note that the learned topological embeddings consider only direct
neighbors. Those embeddings can be fused into our model by simply
con-catenating them to the hidden layer H_{1}. In particular, we choose the most
simple of our model variants,LD_AVG:

H_{1} =qh

AE,ˆ (γ_{1}X_{α}_{1} +· · ·+γ_{k}X_{α}_{k})W_{avg} +b_{avg}i

. (13.10)

The embedding matrix E is initialized randomly and learned together with the rest of the model.

Despite being simple, the above model is already sufficiently expressive to demonstrate how combinations of Ada-LLD with other types of node features can improve classification accuracy. More sophisticated combinations as well as incorporating additional node attributes will be subject to future work.

(a) Accuracy scores for Cora. (b) Accuracy scores forCiteSeer.

(c) Accuracy scores forPubmed.

Figure 13.2: Accuracy scores for the three benchmark data sets.

multilabel prediction tasks, against state-of-the-art methods. For both tasks, we compare our models against the following approaches:

• Adj: a baseline approach which learns node embeddings only based on the information contained in the adjacency matrix

• GCN1_only_L: a GCN model which applies convolution on the label
matrix. We use one convolution layer with the adjacency matrix
with-out self-links, followed by a dense with-output layer ^{1}

• GCN2: the standard 2-layer GCN as in [142] without using the node attributes

• DeepWalk: the DeepWalk model as proposed in [214]

• node2vec: the node2vec model as proposed in [103]

• Planetoid-G: the Planetoid variant which does not use information from
node attributes [271]^{2}

1We use only a single convolution layer due to the reason stated in Section 13.2

2Unless stated differently, we use for all competitors the parameter settings as

sug-For the multiclass problems, we additionally compare against two label prop-agation approaches, i.e., 2-step LP, the two-step label propagation approach proposed in [208].

Furthermore we study the benefits of combining label distributions from multiple scales and the effect of combining our embeddings with homophily based embeddings (13.4). We also analyze whether our label based approach is competitive to methods which additionally take node attributes into con-sideration. Finally, we show that the superiority of our methods is due to the high adaptivity to local label neighborhood.

### Multiclass Prediciton

Experimental Setup

We use the following three text classification benchmark graph datasets [229, 191]:

• Cora. The Cora dataset contains 2’708 publications from seven cat-egories in the area of ML. The citation graph consists of 2’708 nodes, 5’278 edges, 1’433 attributes and 7 classes.

• CiteSeer. The CiteSeer dataset contains 3’264 publications from six categories in the area of CS. The citation graph consists of 3’264 nodes, 4’536 edges, 3’703 attributes and 6 classes.

• Pubmed. The Pubmed dataset contains 19’717 publications which are
related to diabetes and categorized into 3 classes. The citation graph
consists of 19’717 nodes, 44’324 edges, 500 attributes and 3 classes^{3}.
For each graph, documents are denoted as nodes and undirected links
be-tween documents represent citation relationships. If node attributes are
applied, bag-of-words representations are used as feature vectors for each
document.

We split the data as suggested in [271], i.e., for labeled data our training sets contain 20 randomly selected instances per class, the test sets consist of 1’000 instances and the validation sets contain 500 instances, for each method. The remaining instances are used as unlabeled data. For comparison

gested by the corresponding authors. Except for minor adaptations, e.g., to include label information in the one layer GCN model or to make the Planetoid models applicable for multilabel prediction tasks, we use the original implementations as published by the corresponding authors.

3This turned out to be too large for theDynamic LP approach and therefore we cannot show the results ofDynamic LP on the Pubmed network.

we use the prediction accuracy scores which we collected over 10 different data splits.

Since the number of iterations for sampling the graph contexts and the label contexts forPlanetoid are suggested only for theCiteSeer data set, we adapted these values relative to the number of nodes for each graph. For node2vec, we perform grid searches over the hyperparameters p and q with p, q ∈ {0.25,0.5,1.0,2.0,4.0}and window size 10 as proposed by the authors.

For all models except Planetoid unless otherwise noted, we use one hidden
layer with 16 neurons, the learning rate and training procedure are used as
proposed in [142]. Regarding our models, we useα∈ {0.01,0.5,0.9}as values
for the teleportation parameter and = 1e^{−5} as approximation threshold to
compute the APPR vectors for each node. For LD_CONCAT, LD_INDP
andLD_SHARED we use 16 hidden neurons per APPR matrix in the hidden
layer.

Results

Figure 13.2 shows boxplots depicting the micro F1 scores we achieved for the multiclass prediction task for each considered model on the three benchmark datasetsCora, CiteSeer and Pubmed.

Our models improve the best results produced bynode2vec which demon-strates that the label distributions are indeed a useful source of information, although the evaluation for GCN1_only_L shows, especially for Pubmed, rather poor results. This is due to this model considering only the label dis-tribution of a very local neighborhood (in fact one-hop neighbors). However, collecting the label distribution from more spacious neighborhoods gives a significant boost in terms of prediction accuracy. For theCora network, the gain of accuracy of LD_INDP, when comparing to the result of node2vec, is more than 13%. Moreover, for all datasets, our models perform similarly, which shows that even simple models with shared weights are able to match the performance of more complex models.

### Multilabel Classification

Experimental Setup

We also perform multilabel node classifications on the following two multil-abel networks:

• BlogCatalog[245]. This is a social network graph where each of the 10,312 nodes corresponds to a user and the 333,983 edges represent the