• Keine Ergebnisse gefunden

4.1 The topic model: validation, interpretation, and analysis

4.1.1 The first validation step: The word-over-topic distribution

Table 1: The 15 Topics and their Top 50 terms

Table 1 depicts all the topics and their top 50 defining terms. All words score with a certain

probability on all topics; however, they tend to concentrate on one or several topics. This

means that some terms can have a high ranking on more than one topic. See, for example, the

term “ social ” : it is the top term for the Social Theory and for the Micro-Individual topics; it is

the third highest term ranked on the Law-Crime topic; the fourth ranked on both the

Ethnicity-Race and Politics-State topics; the seventh on the Education topic; sixteenth on the

Gender-Family topic; the nineteenth on the Analytics-Quant topic; twenty-sixth on the

Culture-Generic topic; the algorithm allocates the term “ social ” to the top 50-terms for all topics,

except for religion where it ranks 7328 th . That it ranks low on the Religion topic is most likely

a reflection of the fact that other terms are used to describe social relationships by

membership to various groups: using terms as israel, palestinian, cathol (viz. catholic),

christian, jewish, islam, movements, member. For a discipline being mainly about the social

world, it is reassuring that this term is important for all most all topics of sociology.

16 However, the removal of omnipresent or sparse terms might improve topic mixture, but does not usually distort the topic model outcome. We, therefore, tested the topic mixture sensitivity by removing both omnipresent and very sparse terms. The distributions for topics across articles and terms across topics were very similar to this final output presented in Table 1.

Most topics are straightforward to interpret qualitatively and we have therefore stipulated the topic title accordingly: the topic model output merely numbers the topic titles. The following topics do not need much explanation as to why they are named as they are: Education, Law-Crime, Gender-Family, Work-Labor, Economic, Organization, Religion, Ethnicity-Race, Politics-State, Analytic-Quant. Topics that are less obvious, needed some further validation:

Social Theory, Micro-Individual, Global Issues. For these, we read some topic authors and articles to determine their appropriateness (see the two next sections). The topics that are the least obvious are Public and Culture-Generic: in that order. Public is the least obvious since it contain several general terms: will, one, may, can, unit, new, must, etcetera. However, after further analysis including the articles and scholars clustering in this topic, we settled to call it Public. This topic has terms about: public, state, govern, interest. The same logic applies to Culture-Generic topic. We added the suffix Generic to both these topics to indicate this uncertainty.

In this instance, we should flag for the fact that topic model algorithms can sometimes create so-called residual topics; meaning that the algorithm allocates terms to one or more terms that it has difficulty to allocate to more meaningful topics. As the large JSTOR dataset covers diverse time periods, we expect such topic to arise. That the topic Public-Generic is capturing both a meaningful but also a less meaningful topic (residual) is further fuelled by observing the size of the topic proportion during the time period 1890 to at least 1950. It is not ideal to have this large topic proportion, but our sensitivity analysis has not managed to decrease the size of this topic. Additionally, this topic has a relatively high correlation with so called garbage-terms, i.e. terms that do not add any substantive meaning. Even if we have gotten rid of many stop-words, very sparse terms, and other common terms, some terms are difficult to completely get rid of because they can be created by transformation of more substantive terms. Nevertheless, we are confident that given the assumptions of this paper and our sensitivity checks, the issues of residual topics and garbage-terms do not affect the main conclusions of the paper.

One of the most important terms of the new economic sociology is the term embeddedness.

Understanding this term ’ s topic distribution is a prerequisite to determine the meaningfulness of the economic topics that we need to focus on. Figure 2 depicts this distribution. The x-axis represents the 15 topics and the y-axis the topic belongingness probability. Here we see that this term scores highest on Organization followed by Social Theory and Culture-Generic. This is a reassuring result because we know from theory that Granovetter´s work has had a major impact on organization research. 11 Observe that this term has a very low probability score on

11 It also corresponds well to Granovetter’s topic-mix as represented in the heat-map (figure 4).

17 the Economic and on the Work-Labor topics. One of the reasons is, most likely, is that the Economic topic represents work more of the classical type; the Work-Labor topic captures the general field of industrial relations and political economy. Both these are certainly important in their own right to determine the economic orientation of sociology. Hence, we will include all three of them in our further analysis.

Figure 2: The topic distribution of the term “embeddedness”

In summary, we are confident that the algorithm is capturing at least 13 meaningful topics;

three of them having direct relationship to economic orientation of sociology: Economic, Work-Labor, and Organization. It might appear to be remarkable for a social scientist who is not familiar with machine learning to see that a computer algorithm is able to order and allocate terms in the manner shown Table 1 – and the following tables (see next sections).

However, there is as little, or as much, mysteriousness about this algorithm as there is for the

Google Search algorithm. They are of the same algorithmic-family – and who does not use

Google?

18

Nevertheless, even if you use and trust the Google Search algorithm, you might not trust the

outcome of our particular model: in the following two sections we therefore conduct further

robustness analyses of the topic model output to ensure that it is validly and reliably

measuring what it is intended to measure.