Methods for embedding entire graphs with the purpose of graph classifica-tion can be sorted into kernel-based and feature-based methods. Kernel-based methods focus on deriving similarity models which typically perform an implicit feature transformation and rely on kernelized classification mod-els. Feature-based models compute continuous graph representations which are fed to a generic classification model.

### Kernel-based Methods

Established graph kernels include theRandom-Walk (RW)[45],Shortest-Path (SP) [44], Weisfeiler-Lehman (WL) [231] and Graphlet Kernel (GK) [232].

The first two methods rely on additional node labels and count visits to nodes with a certain label in random walks or via shortest paths, respectively. The WL kernel is based on the Weisfeiler-Lehman graph isomorphism test and performs a relabeling of initial node labels. The graphlet kernel counts oc-currences of small induced non-isomorphic subgraphs and does not consider additional node labels. A problem with all of the above kernels is that corre-lations between feature dimensions are not taken into account. This problem is addressed, e.g., in [269, 192, 11] by learning hidden representations of the substructures counted by the respective graph kernels. However, it should be noted that all of the above methods suffer from rather high complexity.

Other works focus on directly specifying graph metric spaces. A classical ex-ample is the Graph Edit Distance [223] which minimizes the number of edit operations that are needed to transform one graph into another one. Such approaches are usually NP-hard and rely on heuristics [31, 196]. For instance,

[31] proposes a family of tractable graph metrics that approximate two com-mon intractable metrics and provide extensions to incorporate additional node attributes. In [196], graphs are represented as bags of node embedding vectors and compared by using the Earth Mover’s Distance (EMD), where additional node labels may be incorporated by matching only nodes with the same label.

### Feature-based methods

Earlier works on feature-based graph classification rely on handcrafted fea-ture embeddings. Notably, NetSimile [34] extracts structural features (such as node degree and clustering coefficient) for each node and aggregates them per graph using different aggregation functions including mean and standard deviation. Similar to graphlet kernels, another line of research focuses on describing graphs by decomposing them into subgraphs. Subgraph2vec [192]

computes subgraph embeddings using a SkipGram model where subgraphs are rooted and context subgraphs are rooted at neighbors of the source root.

The more recent method GE-FSG [194] represents graphs as bags of fre-quent subgraphs and learns graph embeddings using a document embedding technique. GAM [150] addresses the problems of scalability and noise by using an attention model to focus on small and informative parts of a graph.

However, the method relies on additional node attributes. Sub2vec [11] also focuses on embedding subgraphs but does not attempt to represent graphs by their subgraphs.

While GraphWave [82] considers only node embedding, other diffusion-based methods focus on embedding whole graphs. DeepGraph [158] computes graph embeddings based on heat kernel signatures. However, the final em-beddings are trained end-to-end for predicting network growth. In addition to the heat kernel signature,NetLSD [252] further considers the wave kernel signature. Similarly as in GraphWave, signatures of different scales are con-catenated in order to obtain a multi-scale representation and a k-th order approximation is performed to make the eigendecomposition of the Laplacian scalable. Message Passing Networks [98] is a class of neural networks models for graphs. The primary focus of these methods lies on learning embeddings from node attributes and they cannot be applied out-of-the-box to classify general non-attributed graphs. Patchy-san [195] addresses this problem by considering auxiliary node labels such as degree or PageRank centrality.

### Homophily-Based Node Embedding

The work presented in this chapter has been published as the articleLASAGNE:

Locality and Structure Aware Graph Node Embeddings in the Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI), 2018 [92]. The work has been awarded with the Best Student Paper Award.

A preliminary version which serves as a technical report has been published on arXiv as preprint arXiv:1710.06520, 2017 [91].

### 10.1 Introduction

Graphs are a common way to describe interactions between entities. The entities are modeled as nodes, and the interactions between pairs of entities are represented by edges between nodes. Describing nodes of a graph as low dimensional vectors has the advantage that many popular machine learning algorithms can be automatically applied, and it is applicable in many areas like visualization, link prediction, classification, etc. [176, 166, 13, 38]. Mo-tivated by this, so-called representation learning methods for graph vertices, e.g., [214, 244, 103], focus on learning vectors to represent information in neighborhoods around a node, e.g., nodes within a short geodesic distance or nodes encountered in random walks starting at a given node.

Somewhat more formally, letG= (V, E)be a graph, withV ={v_{1}, . . . v_{N}}
being the set of nodes and E = {e |e ∈ V × V} being the set of
(undi-rected) edges. The general goal is to find a vector embedding or latent
rep-resentation for each node v_{i} such that the resulting set of embedded nodes
E ={f(v_{i})|v_{i} ∈V} in the d-dimensional vector space R^{d} still reflects
struc-tural properties of G. For instance, such structural properties could be the

similarity of the neighborhoods of two nodes v_{i} and v_{j}. The neighborhood
N(v) of a node v is defined as the set of nodes having the highest
probabil-ities to be visited by a random walk starting from node v, a geodesic walk
starting fromv, or some other related process. This means ifN(v_{i})≈ N(v_{j})
holds in the original graph space, it should also hold that f(v_{i}) ≈ f(v_{j}) in
R^{d}.

The intuition behind these representation learning methods is that nodes having similar neighborhoods are similar to each other, and thus one can use information in the neighbors of a node to make predictions for a given node.

Defining the right neighborhood for each node, however, is a challenging task. For example, in unsupervised multi-label classification, the labels of the nodes define the underlying local structure for a particular class, but often this does not necessarily overlap significantly with the local structure defined by the the edge connectivity of the graph. Alternatively, realistic graphs typically have large-scale properties that are very poorly structured with respect to the behavior of random walks [152, 155, 153, 134, 10, 9].

The basic assumption of random walk based methods (as well as other related methods) is that nodes visited more often than others by random walks starting from a particular node are also more useful to describe that node in terms of downstream prediction tasks. However, the problem with random walks is that typically most of the graph can be reached within a few steps, and thus information about where the random walk began (which is the node for which these methods are computing the embedding) is quickly lost.

This issue is particularly problematic for extremely sparse graphs with upward-sloping Network Community Profiles (NCPs) [152, 155, 153] and for flat NCPs [134] (expander-like graphs) or deep k-cores [10, 9]. These properties are ubiquitous among realistic social and information networks.

This suggests that, unless carefully engineered, embedding methods based on random walks will perform sub-optimally, since the random walks will mix rapidly, thereby degrading the local information that one hopes they identify.

In this chapter, we explore these issues, and we present a method which takes the local neighborhood structure of each node in the graph individually into account. This leads to improved embedding vectors and improved results in downstream applications for graphs.

Our method,Lasagne, is an unsupervised algorithm for learning locality and structure aware graph node embeddings. It uses an Approximate Per-sonalized PageRank vector [22] to adapt and improve state-of-the-art meth-ods for determining the importance of the nodes in a graph from a specific node’s point of view. The proposed methodology is easily parallelizable, even

on distributed environments, and it has even been shown that the methods we adapt were applied to graphs with more than billions of nodes on a single machine [234].

We evaluate our algorithm with multi-label classification and link pre-diction on several world datasets from different domains under real-life conditions. Our evaluations show that our algorithm achieves better results—especially for downstream machine learning tasks whose objectives are sensitive to local information—in terms of prediction accuracy than the state-of-the-art methods, and our algorithm achieves similar results for link prediction. As has been described previously [155, 134, 10, 9], and as we re-view in Section 10.4, graphs with flat NCPs and many deepk-core nodes have local structure that is particularly difficult to identify and exploit. Impor-tantly, our empirical results for this class of graphs is substantially improved, relative to previous methods. This illustrates that, by carefully engineering locally-biased information into node embeddings, one can obtain improved results even for this class of graphs, without sacrificing quality on other less poorly-structured graphs.

We also illustrate several reasons why random walk based methods do not perform as expected in practice, justifying our interpretation that our method leads to improved results due to the manner in which we engineer in locality.

The remainder of the paper is as follows: in Section 10.2, we survey related work, including theword2vec framework and the approximate computation of the Personalized PageRank; in Section 10.3, we describe our mainLasagne algorithm; in Section 10.4, we present the evaluation of our method and a discussion of disadvantages of previous random walk based methods; and in Section 10.5, we present a brief conclusion.