• Keine Ergebnisse gefunden

Benchmarking the Algorithm

the adhesion of i with s is larger than with r, the total adhesion of s with r decreases. Equivalent expressions can be found for removing a node i from the communitysand rejoining it withr. Forγ = 1andpij =kikj/2M, one has ais+air+ 2cii = 0, andcii < 0by definition and close to zero for all practical cases. Then,aisandairare either both positive and very small or have opposite sign. Choosing the node that gives the smallest∆arswill then result in adding a node with positive coefficient of adhesion to s. It is easy to see, that this ensures a positive coefficient of cohesion in the set of nodes aroundj.

4.11 Benchmarking the Algorithm

In order to benchmark the performance of the Potts model approach to com-munity detection, it is applied to computer generated test networks. Networks with communities of equal and different size were constructed. Those with equal size had 128 nodes, grouped into four communities of size 32. Those with differently sized communities had320nodes, grouped into four commu-nities of size32, 64, 96and 128. In both types of networks, each node has an average degree ofhki = 16. The average number of links to members of the same communityhkiniand to members of different communitieshkoutiis then varied, but always ensuringhkini+hkouti=hki. Hence, decreasingkinrenders the problem of community detection more difficult.

Recovering a known community structure, any algorithm has to fulfill two cri-teria: it has to group nodes in the same community which belong together by design and it has to group nodes apart which belong to different communities by design. The first criterion is called “sensitivity” and measures the percent-age ofpairs of nodeswhich are correctly grouped together. The second criterion is called “specificity” and measures the percentage of pairs of nodes which are correctly grouped apart.

Because of the Poisson nature of the degree distribution, a connection model ofpij =pwas used. Figure 4.6 shows the result of this experiment in compari-son with the results obtained from the algorithm of Girvan and Newman [45].

Clearly, both algorithms show high sensitivity and high specificity. However, the Potts model outperforms the GN algorithm on both types of networks in both sensitivity and specificity. When relaxing the Potts model Hamiltonian from random initial conditions at zero temperature, performance decreases, but is still as good as that of the GN algorithm.

An important aspect is the dependence of the sensitivity (specificity) of the al-gorithm on the number of allowed spin statesq. Figure 4.7 shows that as long asq ≥ 4,i.e.the actual number of communities in the network, the value of q is irrelevant. This result is also independent of the stength of the community structure under investigation,i.e.independent ofkin. Furthermore, it is neces-sary to study the stability of results with respect to a change inγ. As Figure 4.7

6 7 8 9 10 11 12

k_in 0

0.2 0.4 0.6 0.8 1

Sensitivity

Potts Model, q=25 Girvan Newman

4x32

6 7 8 9 10 11 12

k_in 0

0.2 0.4 0.6 0.8 1

Specificity

Girvan Newman

Potts Model, q=25 4x32

6 7 8 9 10 11 12

k_in 0

0.2 0.4 0.6 0.8 1

Sensitivity

Girvan Newman Potts Model, q=25

32+64+96+128

6 7 8 9 10 11 12

k_in 0

0.2 0.4 0.6 0.8 1

Specificity

32+64+96+128

Figure 4.6: Benchmarking the Potts model approach to community detec-tion with networks of known community structure. Sensitivity measures the percentage ofpairs of nodescorrectly identified as belonging in the same com-munity and specificity measures the percentage of pairs of nodes correctly grouped into different communities. Top: 4 communities of 32 nodes each.

Bottom:4 communities of size 32,64,96 and 128 nodes.

shows, the better the community structure is defined,i.e.the greaterkinis with respect tohkithe more stable are the results. The maxima of the curves for all values of kin, however, coincide atγ = 1, i.e. at the point where the contribu-tion of missing and existing links is equal. The same statements also apply to the specificity.

In case of exploring the community structure starting from a single node, the definitions of sensitivity and specificity have to be changed. The percentage of nodes that are correctly identified as belonging to the community around the start node is measured as sensitivity and the percentage of nodes that are correctly identified as notbelonging to the community around the start node as specificity.

Figure 4.8 shows the results obtained for different values ofhkiniatγ = 1and usingpij =kikj/2M as model of the connection probability. Note that this ap-proach performs rather well for a large range ofhkiniwith good sensitivity and

4.11. Benchmarking the Algorithm

10 100

q 0

0.2 0.4 0.6 0.8 1

Sensitivity

k_in=6 k_in=7 k_in=8 k_in=10

0.1 1 10

γ 0

0.2 0.4 0.6 0.8 1

Sensitivity

k_in=8 k_in=10 k_in=12

Figure 4.7: Sensitivity of the Potts model approach to community detection as a function of the parameters of the algorithm for networks with four equal sized communities of 32 nodes each. Left: Sensitivity as a function of the number of allowed spin states (communities)q for differentkin. Right: Sen-sitivity as a functionγfor different values ofkinand withq = 25.

specificity. In contrast to the benchmarks for running the simulated annealing on the entire network as shown in Figure 4.6, a sensitivity that is generally larger than the specificity is observed. This shows that running the simulated annealing on the entire network tends to mistakenly group things apart that do not belong apart by design, while constructing the community around a given node tends to group things together that do not belong together by de-sign. This behavior is understandable, since working on the entire network amounts to effectively implementing a divisive method, while starting from a single node means implementing an agglomerative method.

One real world example with known community structure is the College Foot-ball network from Ref. [45]. It represents the game schedule of the 2000 season of Division I of the US college football league. The nodes in the network rep-resent the 115 teams, while the links reprep-resent 613 different games played in the course of the year. The community structure of this network arises from the grouping into conferences of 8-12 teams, each. On average, each team has 7 matches with members of its own conference and another 4 matches with members of different conferences. A parameter variation in γ at ten values between0.1 ≤ γ ≤ 1 is performed. This allows for the estimation of the ro-bustness of the result with respect toγand the detection of possible hierarchies in the community structures, as low values ofγwill generally lead to a less di-verse community assignment and larger communities. At each value ofγ the system is relaxed 50 times from a randomly assigned initial configuration at T = 0usingq= 50. The connection model chosen was againpij =p.

Figure 4.9 shows the resulting 115 × 115 co-appearance matrix, normalized and color coded. The ordering of the matrix corresponds to the assignment of the teams into conferences according to the game schedule as indicated by

6 8 10 12 14

kin 0.4

0.6 0.8 1

Sensitivity

6 8 10 12 14

kin 0.4

0.6 0.8 1

Specificity

Figure 4.8: Benchmark of the algorithm for discovering the community around a given node in networks with known community structure. Net-works of 128nodes and four communities were used. The average degree of the nodes was fixed to16, while the average number of intra-community links hkini was varied. Sensitivity measures the fraction of nodes correctly assigned to the community around the start node, while specificity measures the fraction of nodes correctly kept out of the community around the start node.

the dashed lines. Apart from regaining almost exactly the known community structure, the Potts model is also able to detect inhomogeneities in the distri-bution of intra- and inter-conference games. For instance, one observes a large overlap of the Pacific Ten and Mountain West conference and also a possible subdivision of the Mid American conference into two sub-conferences. This is due to the fact that geographically close teams are more likely to play against each other as already pointed out in Ref. [45].

4.11. Benchmarking the Algorithm

Figure 4.9: Co-appearance matrix for the football network. A parameter vari-ation ofγ was performed with 10 values between0.1≤γ ≤1. At each value, the system was relaxed 50 times from a random initial condition. The matrix ordering is taken from the assignment of teams into conferences according to the game schedule.