• Keine Ergebnisse gefunden

Follower Followee

2.4 Topical Opinion Leader Identification FrameworkFramework

2.4.1 Topic-sensitive Influence Measure

We first conduct topic preprocessing to represent the topic interest of each user, and then confirm the existence ofhomophilyin our dataset. Based on this topic preprocessing and the finding, a novel approach to measure users’ topic-sensitive influence is proposed in this section.

Table 2.2 lists the descriptions of notations.

2.4 Topical Opinion Leader Identification Framework 21

Table 2.2:Notation descriptions.

Notation Description

n the total number of users

s the total number of unique topics

A,Q n×smatrix, whereAi,t/Qi,t contains the number of topictin userui’s answers/questions

V n×smatrix, whereVi,t contains the number of votes received by userui in topict

AM,QM n×7matrix, whereAMi,t/QMi,tcontains the number of major topictin userui’s answers/questions

CM n×7matrix, whereCMi,t contains the number of major topict in userui’s posts (questions and answers), i.e.,CMi,t =AMi,t+ QMi,t

Topic Preprocessing. The purpose of topic preprocessing is to identify each user’s topic interest. In Zhihu, each post of a user is always related to many unique topics so that a user has much more unique topics in the published posts. Hence, directly leveraging these unique topics to represent the topic interest of a user is very intricate because of their amount and diversity.

To this end, utilizing the tree-like topic structure of Zhihu, we aggregate these topics into seven major topics, which cover all the topic fields in Zhihu. It is worth noting that, besides 6 child topics of the root topic, we select another representative topic “Science & Technology” that had not been edited into the topic structure due to some mistakes from Zhihu topic organization.

Using this topic aggregation method, each post’s topics of each user are transformed to the corresponding major topics according to the topic relationship in the topic structure.

To identify each user’s topic interest, we first compute the topic interest of each user’s questions and answers over the major topics respectively. We can row normalizeAM,QM intoAM0,QM0such that||AMi,.0 ||1 = 1for each rowAMi,.0 and||QMi,.0 ||1 = 1for each row QMi,.0 . Each row of these two matrices denotes the probability distribution of a user’s interest in question/answer. Using a distance metric for probability distribution [28], the topic interest differenceT Dbetween questions and answers of useruican be calculated as:

T DQA(i) =T D(AMi,.0 , QMi,.0 )

=qDKL(AMi,.0 ||M) +DKL(QMi,.0 ||M) (2.2) whereM = 12(AMi,.0 +QMi,.0 ). DKLis theKullback-Leibler Divergencewhich defines the divergence from distributionHtoIas:DKL(H||I) =PiH(i) logH(i)I(i).

Figure 2.6 demonstrates the Cumulative Distribution Function (CDF) of topic interest difference between questions and answers of each user. The analysis is applied on a set of 181K+ users who posted at least one question and one answer. We can observe that the topic interests of their questions and answers for most users are similar. Hence, in this work, the major topic probability distribution of posts published by each user is utilized to present each

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Topic interest difference between questions and answers 0.0

0.2 0.4 0.6 0.8 1.0

CDF

Median=0.055 Mean=0.12

Figure 2.6:Topic interest difference between Q&A.

1-topic 13.6%

2-topic 15.7%

3-topic

10.2%

4-topic 10.4%

5-topic 7.6%

6-topic 3.4%

7-topic 39.0%

Figure 2.7: Distribution of question topic type.

user’s topic interest. Namely, after the row normalization,CMi,t0 indicates the probability that useruiis interested in topict. Note that the topics are transformed to the corresponding major topics only in the user topic interest calculation process.

Besides, to examine the topic diversity in SCQA sites, Figure 2.7 illustrates the distribution of question topic type in Zhihu, wherek-topic means a type of questions that is relevant to k major topic(s). We can observe that multi-topic questions account for 86.4%, implying that multi-topic questions are pervasive in Zhihu. Inspired by this, our proposed algorithm is required to support identifying multi-topic opinion leaders.

Homophily. To assist in measuring the true topical influence of each user, we need to examine whetherhomophilyexists in the social network of our dataset, which has been observed in many social networks [62, 100]. The phenomenon shows that users follow each other on account of similar topic interest, which means that the influence on each follower would depend

2.4 Topical Opinion Leader Identification Framework 23

on the topic interest. The question can assist in verifying whetherhomophilyexists in Zhihu:

Do users with “following” relationships have more similar topic interest than those without?

The question can be formalized as a two-sample t-test: The null hypothesis isH0 :µf ollow= µunf ollow, and the alternative hypothesis is H1 : µf ollow < µunf ollow, where µf ollow is the mean topic interest difference between two users with “following” relationship, andµunf ollow

indicates the mean topic interest difference of those without. We designhomophilytesting and evaluation experiments based on a set of active Zhihu users who published at least 10 posts in total, denoted asU(|U|= 124,445). We conduct the two-sample t-test on the user congregation because around 92% of the users in our dataset have less than 30 followees. Sample 0 contains the topic interest difference of all the user pairs with “following” relationships while Sample 1 contains the topic interest difference between each user and some randomly chosen users whom he/she does not follow. Note that the number of each user’s chosen non-followees is identical to the number of each user’s followees. The topic interest difference between two users is calculated asT Du(i, j) =T D(CMi,.0 , CMj,.0 ). The t-test result shows thatH0is rejected at significant levelα= 0.01with a p-value of less than1×10−17. The t-test result depends on the extent of the dataset normality. Skewness and kurtosis of these two samples are 1.19, 2.14 and 1.21, 2.09, which are considered acceptable in order to prove normal distribution [34].

Hence, we confirm that the existence ofhomophilyin Zhihu.

QARank Algorithm. Based on the above process, we propose a novel topic-sensitive influence measure algorithm called QARank, which incorporates three factors:

• Network structure: A user’s influence is propagated to other users through following links between them in SCQA. Hence, QARank considers the link structure, similar to the authority measure of a web page.

• Topic interest:Based onhomophily, a user’s topical influence on his follower is stronger when their interests in this topic are more similar and vice versa. A user has different influence in different topic in the same social network.

• Knowledge authority: Generally a user’s opinion is always accepted by his followers when his answers obtain many votes. Hence, the knowledge authority of a user plays an important role in his influence. Specifically, the more votes a user received, the more authoritative his followers think he is.

The proposed QARank, as an extension of TwitterRank, is modeled as a random surfer model.

LetGbe a directed graph where each node indicates a user and each directed edge denotes a “following” relationship between two users. A random surfer on the graphGvisits each user with certain probability through following the corresponding edge. QARank differentiates itself from TwitterRank in that the topical knowledge authority is considered into the transition probability from one user to another meanwhile QARank can measure the multi-topic influence

Receive 400 votes in topic T

Receive 200 votes in topic T

Figure 2.8:Example of transition probability calculation in QARank.

by leveraging Euclidean distance to measure the topic interest difference. Hence, each element of the transition matrixPT for the topic setT (|T| ≥1) is calculated as:

PT(i, j) = |Vj,T| P

k:uif ollows uk|Vk,TsimT(i, j) (2.3) where

simT(i, j) = 1− sX

t∈T

(CMi,t0CMj,t0 )2 (2.4) where PT(i, j) is the transition probability from followerui to followee uj in the random surfer model. |Vj,T|=Pt∈T Vj,tis the number of votes received by useruj in topicT, and P

k:uif ollows uk|Vk,T|is the total number of votes received by allui’s followees in topic setT. In the model, the number of topic-related votes received is regarded as the topical knowledge authority of a user. Figure 2.8 shows an example about three users.ucfollowsuaandub, who received 400 and 200 votes in topicT respectively. In this case,ua’s influence onucis two times of that ofub, when the topic interest similarity among the three users is not considered.

Of cause,uc’s influence onuaandubare also related to the topic interest similarity between them.

In addition, in case of dangling nodes that do not have any out-degree and cyclic loops in the network, we apply random jump [38] by adding a teleportation vectorET:

ET =A00.,T (2.5)

whereA.,T =Pt∈T A.,t, andA00.,T is the column-normalized version ofA.,T so that||A00.,T||1 = 1.

Given the transition probability matrix and the teleportation vector, the topical influence scores of users in topic setT, known asInfT, can be calculated iteratively as:

InfT =λPT ×InfT + (1−λ)ET (2.6)

2.4 Topical Opinion Leader Identification Framework 25