Algorithm 11Ada-LLD
Input: Graph G= (V, E), Label matrix Ytrain, Approximation threshold , Teleportation parameters {α1, . . . , αk}
Output: Trained classifier f
1: // Compute label distributions at each scale
2: declareX ∈Rk×n×l
3: for αj ∈ {α1, . . . , αk} do
4: declare Xαj ∈Rn×l
5: for vi ∈V do
6: ldi ←Compute_LD(vi, αj, ) (see Algorithm 10)
7: Xαj[i,:] =ldi
8: end for
9: X[j,:,:] =Xαj
10: end for
11: // Train classifier with SGD
12: f =H2(H1(X))
13: f ←SGD(f, Ytrain, `)
14: return f
by simply multiplying an embedding matrix E ∈ Rn×d (where d is the di-mensionality of the embedding) to the pre-processed adjacency matrix Aˆas in [142]. Note that the learned topological embeddings consider only direct neighbors. Those embeddings can be fused into our model by simply con-catenating them to the hidden layer H1. In particular, we choose the most simple of our model variants,LD_AVG:
H1 =qh
AE,ˆ (γ1Xα1 +· · ·+γkXαk)Wavg +bavgi
. (13.10)
The embedding matrix E is initialized randomly and learned together with the rest of the model.
Despite being simple, the above model is already sufficiently expressive to demonstrate how combinations of Ada-LLD with other types of node features can improve classification accuracy. More sophisticated combinations as well as incorporating additional node attributes will be subject to future work.
(a) Accuracy scores for Cora. (b) Accuracy scores forCiteSeer.
(c) Accuracy scores forPubmed.
Figure 13.2: Accuracy scores for the three benchmark data sets.
multilabel prediction tasks, against state-of-the-art methods. For both tasks, we compare our models against the following approaches:
• Adj: a baseline approach which learns node embeddings only based on the information contained in the adjacency matrix
• GCN1_only_L: a GCN model which applies convolution on the label matrix. We use one convolution layer with the adjacency matrix with-out self-links, followed by a dense with-output layer 1
• GCN2: the standard 2-layer GCN as in [142] without using the node attributes
• DeepWalk: the DeepWalk model as proposed in [214]
• node2vec: the node2vec model as proposed in [103]
• Planetoid-G: the Planetoid variant which does not use information from node attributes [271]2
1We use only a single convolution layer due to the reason stated in Section 13.2
2Unless stated differently, we use for all competitors the parameter settings as
sug-For the multiclass problems, we additionally compare against two label prop-agation approaches, i.e., 2-step LP, the two-step label propagation approach proposed in [208].
Furthermore we study the benefits of combining label distributions from multiple scales and the effect of combining our embeddings with homophily based embeddings (13.4). We also analyze whether our label based approach is competitive to methods which additionally take node attributes into con-sideration. Finally, we show that the superiority of our methods is due to the high adaptivity to local label neighborhood.
Multiclass Prediciton
Experimental Setup
We use the following three text classification benchmark graph datasets [229, 191]:
• Cora. The Cora dataset contains 2’708 publications from seven cat-egories in the area of ML. The citation graph consists of 2’708 nodes, 5’278 edges, 1’433 attributes and 7 classes.
• CiteSeer. The CiteSeer dataset contains 3’264 publications from six categories in the area of CS. The citation graph consists of 3’264 nodes, 4’536 edges, 3’703 attributes and 6 classes.
• Pubmed. The Pubmed dataset contains 19’717 publications which are related to diabetes and categorized into 3 classes. The citation graph consists of 19’717 nodes, 44’324 edges, 500 attributes and 3 classes3. For each graph, documents are denoted as nodes and undirected links be-tween documents represent citation relationships. If node attributes are applied, bag-of-words representations are used as feature vectors for each document.
We split the data as suggested in [271], i.e., for labeled data our training sets contain 20 randomly selected instances per class, the test sets consist of 1’000 instances and the validation sets contain 500 instances, for each method. The remaining instances are used as unlabeled data. For comparison
gested by the corresponding authors. Except for minor adaptations, e.g., to include label information in the one layer GCN model or to make the Planetoid models applicable for multilabel prediction tasks, we use the original implementations as published by the corresponding authors.
3This turned out to be too large for theDynamic LP approach and therefore we cannot show the results ofDynamic LP on the Pubmed network.
we use the prediction accuracy scores which we collected over 10 different data splits.
Since the number of iterations for sampling the graph contexts and the label contexts forPlanetoid are suggested only for theCiteSeer data set, we adapted these values relative to the number of nodes for each graph. For node2vec, we perform grid searches over the hyperparameters p and q with p, q ∈ {0.25,0.5,1.0,2.0,4.0}and window size 10 as proposed by the authors.
For all models except Planetoid unless otherwise noted, we use one hidden layer with 16 neurons, the learning rate and training procedure are used as proposed in [142]. Regarding our models, we useα∈ {0.01,0.5,0.9}as values for the teleportation parameter and = 1e−5 as approximation threshold to compute the APPR vectors for each node. For LD_CONCAT, LD_INDP andLD_SHARED we use 16 hidden neurons per APPR matrix in the hidden layer.
Results
Figure 13.2 shows boxplots depicting the micro F1 scores we achieved for the multiclass prediction task for each considered model on the three benchmark datasetsCora, CiteSeer and Pubmed.
Our models improve the best results produced bynode2vec which demon-strates that the label distributions are indeed a useful source of information, although the evaluation for GCN1_only_L shows, especially for Pubmed, rather poor results. This is due to this model considering only the label dis-tribution of a very local neighborhood (in fact one-hop neighbors). However, collecting the label distribution from more spacious neighborhoods gives a significant boost in terms of prediction accuracy. For theCora network, the gain of accuracy of LD_INDP, when comparing to the result of node2vec, is more than 13%. Moreover, for all datasets, our models perform similarly, which shows that even simple models with shared weights are able to match the performance of more complex models.
Multilabel Classification
Experimental Setup
We also perform multilabel node classifications on the following two multil-abel networks:
• BlogCatalog[245]. This is a social network graph where each of the 10,312 nodes corresponds to a user and the 333,983 edges represent the