• Keine Ergebnisse gefunden

applied to various scientific disciplines (McFarland et al. 2013, Teich et al. 2015, Argamon, Dodick, and Chase 2008), but also to newspaper or historical-document corpora (McCallum and Corrada-Emmanuel 2007, Block 2006), social media (Zhao et al. 2011) and fictional texts (Blevins 2010). Closest in content is a method article which investigates 100 years of German sociology through the yearly proceedings of the German Sociological Society (Bleier and Strotmann 2013). The purpose of many of these analyses, however, rather lies in introducing and probing a new method than in making a contribution to existing debates in the sociology of science.

Our contribution is at the very intersection of these two cited strands of literature. On the one hand, we rely on the most recent developments in automated content analysis, namely topic modeling and related techniques. Contrary to much of the technical literature, we do not use these techniques as l'art pour l'art to make methodological claims or to create topics inductively without further questioning. Rather, on the other hand, we address concrete claims made about the development of one of sociology's main sub-disciplines, namely economic sociology. The self-description of this discipline has up-to-now not been subject to quantitative text analyses, but has rather relied on classical content-analysis techniques or the treatment of individual texts thought to be hermeneutically crucial for the discipline. Our contribution thus lies in use of a new technique for answering existing questions of a sociological sub-discipline using unique and extensive data.

3 Data and Methodology

3.1 Data the full JSTOR sociology data between 1890 to 2014

Our original data consist of 142 040 full-text articles from 157 journals, all written in English language. The time range is 1890 to 2014. These articles are the full sociology 3 journal coverage of JSTOR (see pp. 1-3 in appendix), provided by their service Data for Research (http://dfr.jstor.org/). This data were accessed by agreement with JSTOR on 12 th December, 2014.

The full-text articles were cleaned, organized, and analyzed using the R programming language accompanied with a variety of packages, most notably: the tm (text mining) package for management of the text corpus (Meyer, Hornik, and Feinerer 2008); the dplyr package for general data management (Wickham and François 2005); the topicmodels package for estimating latent topics (Hornik and Grün 2011); ggplot2 for graphical outputs; and R2MLwiN accompanied by the MLwiN software to fit multilevel growth models (Rasbash et al. 2015).

We cleaned and organized the data in the following steps. First, data was downloaded as string objects from JSTOR's HTML provided data. Second, we extracted meta-data about the

3 What counts as “sociology” journal is in fact also determined by automated content analysis procedures

(personal communication from JSTOR, 21 st August 2015).

8 articles using regular expressions and stored everything as a large matrix (one row per article, and as many columns as needed to store meta-data). Third, we created another large matrix called document-term matrix, which consists of all the words (terms) contained in the articles.

We went then on to remove common words (called stop-words, such as: and, is, or,), numbers, white space and punctuations from this matrix. We also stemmed all the terms. This means that different grammatical forms of a term are reduced to the common root component:

for example, the words “ industry, industrial, industrious, industrialist ” thus all become reduced to “industri” . These are all common procedures in text mining. After applying these procedures, we still had about 3.7 million common terms. We went a step further and removed terms that are sparse, i.e. not shared by many documents – we removed terms with less than 0,001% prevalence rate, which resulted in 216.406 remaining terms. 4

Figure 1: Flow-chart of how we transformed the data

4 We experimented with different thresholds and found 10 % prevalence rate to provide enough words to distinguish variety between articles. When we applied the topic modelling algorithm on the 3.7 million terms, we never managed to get the model to converge – even after running it for about six weeks.

Corpus

9 3.2 Limitations of the data

There are several data limitations that need to be mentioned. First, we have analyzed only articles written in English which means that we do not cover developments in some important journals written in other languages, such as: in French, Actes de la recherche en sciences sociales, Revue française de socio-économie, and Revue française de sociologie; in German, Kölner Zeitschrift für Soziologie und Sozialpsychologie, and Zeitschrift für Soziologie; in Italian, Stato e mercato. The reason to focus on English only is that the quantitative text mining technique we use is most effective when dealing with one language at a time.

Second, even in the English-speaking world, we do not cover all journals and all years. The problems striking an account of all sociology journals have been well-known: no index seems to cover all self-declared sociology journals, instead they include many self-declared non-sociological journals, the journal population is constantly changing and sociology is also published outside of sociology journals proper (Bell 1967, Hardin 1977: 32f). Our corpus essentially covers the major journals of the discipline which have been used as representative of American sociology for various periods in preceding studies (Sieg 2002: 111f, Abend, Petre, and Sauder 2013, Abend 2006). Beyond this common basis, there are divergences from article collections other than JSTOR: SocIndex is probably the most encompassing sociological research database with almost 900 full-text journals with 700,208 English articles for the period 1895-2015 (SocIndex, 07.13.2015). A closer look reveals that this larger number of journals and articles is mostly achieved by extending sociology in the neighboring fields of psychology, criminology, regional studies, etc. The well-known Web of Science Core Collections, in turn, lists 139,773 articles for "sociology" from the 19th-century to 2015 which closely resembles our data volume. The Social Science Citation Index (SCCI) lists 142 journals in sociology (2015), 31 of which intersect with our corpus because the SCCI also includes many non-English journals. The intersection set includes well-known, highly ranked sociology journals, while no clear topic-related pattern can explain the coverage by one, but not the other listing. A shortcoming of our corpus is the absence of some newer journals associated with economic sociology 5 such as Socio-economic Review. As also other non-economic-sociology journals were not recently included, however, and as this concerns only the most recent period, one cannot speak of a systematic distortion of the entire corpus.

Finally, the Scopus database lists 1,7 million articles in "social sciences" since 1960, while no further discipline-refinement is possible.

Third, while we go beyond bibliometric analyses that do not use the textual body of articles, we do not analyze books, which means that any conclusions reached in this paper could be distorted by developments in the book area. Most of the sociology classics, for instance, wrote in book form. Since their influence had a certain time-lag, it might somewhat affect negatively our hypothesis about the U-shaped curve of economic orientation of sociology articles. Still, it would be somewhat unlikely that the language, the vocabulary, and the

5 See: http://econsoc.mpifg.de/links.asp

10 research interests in sociology books are disjunctive to developments in sociology articles.

Accordingly, analyzing journal-articles only will give a reasonably good picture about developments in sociology.

These three limitations – English centeredness, selective journals, no books – should be kept in mind when interpreting the results. They are not in principle unsolvable restrictions, but face any data project of this size. In the Discussion-section we will point to some ways in which future research will be able to deal with these limitations.

3.3 Topic modeling

As pointed out by DiMaggio et al. (2013) sociologists analyze text in one of three approaches:

qualitative reading of text, semi-structured qualitative reading with a coding sheet, or fully automated algorithmic analyses. One of the main limitations of the first approach comes mainly in terms of producing reproducible results. Two of the limitation of the second approach are that it is impractical to use for large corpora: we have about 140.000 articles, if one would spend two hours to read each article without doing anything else (eating, sleeping, publishing etc.), it would take about 32 years to get through our corpora – without any analyses of the articles. Moreover, it would also be difficult to achieve a reasonable degree of inter-coder reliability, would one employ several coders instead. The main limitation of the third approach is that the meaning of a text (an article) is reduced to its constitutive words (keywords), without necessarily looking at the discursive, contextual, and linguistics relations between these words. Frequency based content analysis is an example of such an approach (Jockers, 2014, p. 73 ff. ; Stone, 1966). What we need, as DiMaggio et al. (2013, p. 577) argues, are approaches that must satisfy four desiderata: first, they must be explicit which means that data and estimation methods are reproducible and transparent; second, it needs to be automated in order to allow for analysis of large corpora; it must be inductive to allow for discovery of underlying structures and so for (qualitative or quantitative) hypothesis testing;

fourth, the approach must account for the relationality of meaning across varying discursive and linguistic contexts. Topic modeling fulfills all these conditions (Blei et al., 2003; Blei and Lafferty, 2007).

The basic idea behind topic modeling is that of a bag of words. 6 The main assumption is that there are certain given latent topics that inform a given field (e.g. sociology) and which condition the writing of documents (e.g. articles). Each topic (i.e. the bag) has then a list of all terms that exist in that field (i.e. words) which they load on with a certain probability; each document in turn, loads on each topic with a certain probability. The “writing process of a document” can then be des cribed in the following steps: assume that we have 20 topics: first,

6 There are several pedagogical or technical introductions to how topic models work and examples of its

application (DiMaggio et al., 2013; Fligstein et al., 2014; Newman et al., 2006), we will here only give a brief

primer.

11 take a random document in the field of sociology and roll a dice with 20-sides (20 topics) with a likelihood for each side equal to the documents topic probabilities – in other words, it is a weighted dice. Imagine that our dice showed topic 5. Second, go to topic 5 and roll another dice but now one with sides equal to terms (assume that we have 3000 terms) and weighted according to the probability distribution of words across that topic. Picture that we roll the 3000- sided dice and we get the word “market”. Third, we assign the term “market” to our document and re-do the whole process again until we fill up all the so-called tokens of that document. Tokens are the sum of all the randomly assigned words for each document: we may re-use a word in the process described above. Accordingly, a document score to all the topics (document-topic matrix) with a certain probability – summed to 1 for each document;

all topics score to all words (topic-words matrix) with a certain probability – summed to 1 for each topic.

The essential task of topic modeling is to estimate these probabilities: in the example above, all parameters were assumed. It does so by back-tracking the whole process. There are several estimation algorithms, but the most common and the one we use is called Latent Dirichlet Allocation (LDA), which is underpinned by Bayesian statistical theory. LDA has a relational and machine learning approach to modeling language. The algorithm will seek to find structure in the corpus by co-occurrence between words with respect to how they cluster in documents. The only observed data are words and documents whereas topics are estimated.

As DiMaggio et al. (578) describe:

“LDA trades off two goals: first, for each document, allocate its observed words to few topics; second, for each topic, assign high probability to few words from the vocabulary. Notice that these goals are at odds.

Consider a document that exhibits one topic. Its observed words must all have probability under that topic, making it harder to give few words high probability. Now consider a set of topics, each of which has very few words with high probability; documents must be allocated to several topics to explain those observations, making it harder to assign documents to few topics. LDA finds good topics by trading off these goals.”

An important premise to bear in mind is that the number of topics has to be specified by the researchers manually, which some have suggested to be problematic (Schmidt 2013).

However, we argue, that this manual specification does not pose a problem per se. We regard topic modeling as a way of solving a jigsaw puzzle: whether the puzzle consists of 20 pieces or 2000 pieces, it will always reconstruct the exact same picture. To ensure interpretability, and in similarity to DiMaggio et al. (2013) define 12 topics, and Fligstein et al. (2014) specifies 15 topics, we kept our topics to 15.

3.4 Multilevel modeling

While topic modeling measures the topical orientation in sociology, it lacks standardized mechanisms to test various hypotheses. We use, therefore, multilevel modeling (also known as, random effect models, mixed models, or hierarchical models) as supplementary method.

The main advantage of multilevel modeling is that it allows us to capture the time-trend of the

12 economic orientation of sociology as well as control for the journal clustering of articles (Singer and Willett 2003, Steele 2014). We chose multilevel models over fixed effect models because we do want to estimate both the within and between journal variation to determine their relative importance. When the model is defined correctly, it can properly estimate what a fixed effect model can do (capture time-varying effects) and other more useful estimations (time-invariant effects, partition lower and higher level variance, clustering effects, etc.) (Bell and Jones 2014)

We will define the following baseline multilevel model, and vary it according to the hypotheses we will test (formulated in the following sections) and dependent variables we will define. It has the following fixed part 7 :

�� �� �� = � �� + � � �� + � � ��

The dependent variables are captured by the term �� �� �� which measures the economic orientation of a particular article and will be derived from the topic model output: in this paper we will device two different dependent variables that follow directly from the topic

7 Observe, the terminological difference between a fixed effects models and the fixed part of multilevel modelling. Fixed effects model is a type of regression, whereas the fixed part of a multilevel model captures the average effect of the specified variables.

8 Whereas the fixed part captures the average estimated effect, the random part captures how the effect is

distributed (deviates) for each and every case (articles, and journals).

13 model analysis. The index a is the article identifier: it runs from [1, 2, …136843] – the total valid articles 9 in our sample. The index j is t he journal identifier: it runs from [1, 2, … 143] – covering all the journals in our sample. This also means that all the articles (136.843) are hierarchically nested in 143 journals. Moreover, the values of �� �� �� runs from 0 to 1 since it is a proportion variable. Higher values indicate that article a in journal j has a larger economic orientation than lower values.

The focal independent variables are the era-variables which are defined as the mean (intercepts) economic orientation during the eras 1890to1920,1921to1984, 1985to2014, respectively; the other three are defined as the trend (slopes) of the economic orientation of articles during the 1890to1920 slope ,1921to1984 slope ,, 1985to2014 slope ,. The variables estimating the mean are all defined as dummy variables; the variables estimating the slope are timer-variables counting in decimal years when the article was published in the relevant era. For example, an article published mid-1986 will have a value of 1.5 counter (one and a half year):

it counts the difference of when the era starts and the date when the article was published in that relevant era.

If the general story described by the standard narrative of economic sociology is correct, the classical era should have a high 1890to1920 value with an increasing 1890to1920 slope , the intermediary era should have a lower 1921to1984 with a decreasing 1921to1984 slope , the New Sociological-era should have again a high 1985to2014 and an increasing 1985to2014 slope . Besides the era-variables, which are all based on time, we controlled for the page length of an article: whether an increasing page length also generates an article with more economic orientation.

All our main results will be based on linear multilevel models. The ideal case would be to use a Tobit model since the dependent variable is censored; or Dirichlet regression models since we are modeling proportions. However, we prefer to not add additional complexity to an already complicated regression model. A multilevel Tobit model or Dirichlet regression will force us to do just that. Still, using single level Tobit regressions models shows that the coefficients are relatively robust compared to a multilevel model. For a Dirichlet model, one needs to assume a negative correlation between all topics. This does not hold for our model.

Another approach would be to use logit or probit models. This approach has its own limitation: we would have to come up with, mostly arbitrary, thresholds to define a dichotomous dependent variable. Since we are primarily interested in the general trend of the coefficients (positive, negative or zero) and only secondarily in the magnitude of this trend, we chose to work with linear models.

9 Several articles lack page number and have to be dropped from the multilevel analysis. These articles are still

used in the topic modelling.

14

4 Analysis and Result

The main aim of this section is to analyze the functional shape (trend) of the economic topics in sociology with the help of the JSTOR-data. This analysis involves the following elements:

(1) implementation of a topic model in order to measure the development of 15 topics in sociology over the last 124 years in the data sample described above; (2) the topic modeling output is then used as input for the multilevel regression, which then was used as a formal test of our hypotheses; and (3) further sensitivity analysis of mainly the topic model but also the multilevel model.

4.1 The topic model: validation, interpretation, and analysis

As described in the methodology section, we applied an unsupervised machine learning algorithm named Latent Dirichlet Allocation. Our model used a predefined number of topics:

15. This number merely defines the number of clusters that we want the algorithm to order the terms (words) of the articles into and then the topic (viz. 15) distribution across each article.

We also set the α -parameter to 0.01: this defines the prior Dirichlet distribution the model should assume. The lower the α -parameter, the more concentrated topic distribution the model will assume and thus generate: this means that the model will try to assign the most probable topics with even higher probability, the lower the α -parameter is. Conversely, the higher the α -parameter is, the more uniform the topic distribution will be across each article. We have experimented with various topic numbers and α -parameters; we still found the results to be robust. 10

Before presenting the results, we will do some validity check (DiMaggio, Nag, and Blei 2013,

Grimmer 2010) to show that the model measures what knowledgeable readers would expect it

to measure. In the first step and section, we ask: does the term by topic distribution

intelligible? Do the words cluster in a way that they would describe a substantive sociology

topic? From this we will also stipulate the names for the 15 topics since the topic model will

only number the topics one through fifteen. In this step, we also check the topic distribution of

the term “embeddedness” – the hallmark concept of the new economic sociology – to

Before presenting the results, we will do some validity check (DiMaggio, Nag, and Blei 2013,

Grimmer 2010) to show that the model measures what knowledgeable readers would expect it

to measure. In the first step and section, we ask: does the term by topic distribution

intelligible? Do the words cluster in a way that they would describe a substantive sociology

topic? From this we will also stipulate the names for the 15 topics since the topic model will

only number the topics one through fifteen. In this step, we also check the topic distribution of

the term “embeddedness” – the hallmark concept of the new economic sociology – to