Extracting Topics - User ’ s Guide

The Topic Extraction feature of WordStat attempts to uncover the hidden thematic structure of a text collection by applying a combination of natural language processing and statistical analysis. The main statistical procedure used for topic extraction in Wordstat is a factor analysis. Technically speaking, such an extraction is achieved by computing a word x document frequency matrix, or alternatively by segmenting documents into smaller chunks and computing a word x segment frequency matrix. Once this matrix is obtained, a factor analysis with Varimax rotation is computed in order to extract a small number of factors.

All words with a factor loading higher than a specific criterion are then retrieved as part of the extracted topic. While in hierarchical cluster analysis, a word may only appear in one cluster, topic modeling using factor analysis may result in a word being associated with more than one factor, a characteristic that more realistically represents the polysemous nature of some words as well as the multiplicity of context of word usages.

The current implementation of the topic modeling procedure has a limit of 2500 words or content categories. We are working on ways to increase the capability to at least twice this amount). To insure the stability of the factoring solution, low frequency items should preferably be excluded. It is thus strongly recommended to remove any word occurring less than 10 times on smaller data sets, ideally less than 30 to 50 times on larger ones. Stemming, lemmatization or the creation of a categorization dictionary may also be used to group words or phrases, including less frequent ones, prior to the topic extraction.

WordStat provides the following analysis options to control the topic modeling process:

Segmentation - This option allows one to specify whether the data to be used for topic modeling will be based on the co-occurrence of words in the same document, or whether they will be based on co-occurrence within paragraphs or sentences. The choice of segmentation should ideally reflect how topics are being distributed in a typical document and across documents as well as the objective of the analysis. When the text collection consists of long document containing multiple topics (such as long political speeches) and one needs to identify all topics in order to compare their relative frequencies, then performing a segmentation by paragraph or by sentence may be more sensitive than computing co-occurrences by documents. Alternatively, if one attempts to differentiate documents by identifying domains or disciplines, or to identify the dominant issue of documents, then performing the analysis at the document level may be more appropriate. When analyzing responses to open-ended questions, which may include several topics listed in a single paragraph, segmenting by sentence may also result in a more precise extraction of the various topics they contain.

No. of Topics - Setting this option allows one to specify how many topics to extract.

Loading - This option allows one to set the minimum factor loading an item (typically a word) should reach in order to be retained in the factor solution. By default, this value is set to 0.4. Increasing the cutoff value will reduce the number of words, keeping only the more representative ones, while reducing it may include words that are somewhat less characteristic of the extracted topic.

Once the options have been set, click the button to perform the analysis. Please note that extracting topics on more than a few hundred words can take several minutes. Once extracted, the Topic

Modeling page should looks like this

The table located on the left contains the following information:

NO The factor number. Please note that some factor numbers may be omitted if none of their items attained the factor loading cutoff criteria. When factors are being merged by the user, this column displays the numbers of all factors that have been merged together.

NAME WordStat use an algorithm to automatically provide a label for the extracted topic.

This label may be edited by cliking the button.

KEYWORDS This column list all keywords meeting the factor loading cutoff criteria in descending order of factor loading.

% VAR This column shows the percentage of variance explained. Please note that the smaller the segment one chose, the lower the percentage.

FREQ This column displays the total frequency of all items listed in the keywords column.

CASES This column displays the number of cases containing at least one of the items listed in the keywords column.

% CASES This column displays the percentage of cases with at least one of the items listed in the keywords column.

Topic modeling buttons:

This button allows you to delete the topic on the selected row. .

Click this button to merge a topic in another one. You first need to select the row containing the first topic you would like to merge and then click this button. A dialog box will appear with a list of all other topics. Select the second topic and click OK .

To rename a topic, select it first and then click this button. Type the new name and click OK.

To retrieve segments associated with a topic, select it and click this button. All text segments containing at least two keywords of the selected topic will be retrieved and presented in a table format. You may however change both the type of segments retrieved (paragraphs, sentences or full documents) or the minimum number of topic words needed for retrieval.

This button allows one to perform cooccurrence analysis of all the extracted topics including clustering and multidimensional scaling, and create proximity plots as well as link charts. For more information on the various features available, see the Co-Occurrence Page topic.

This button allows one to perform full crosstabulation analysis of all the displayed topics with structured data, apply statistical analysis and create various charts such as correspondence plots, heatmaps, bubble charts and bar charts. For more information on the various features available for crosstabulation analysis, see the Crosstab Page topic.

This button stores the extracted topics currently displayed into a new categorization dictionary where folders at the first level correspond to different topics, and where each of those folders contains the associated words. A dialog box allows one to save

Press this button to append a copy of the topic table in the Report Manager. A descriptive title will be provided automatically. To edit this title or to enter a new one, hold down the SHIFT keyboard key while clicking this button (for more information on the Report Manager, see the Report Management Feature topic).

This button allows to store the topic table to disk in various formats, including Excel, tab and comma delimited files, plain text, HTML, XML, SPSS or Stata files.

Clicking this button allows you to print a copy of the displayed chart

Using the right panel

On the right of this table, a panel allows one to look at the distribution of the selected topic among values of up to two structured variables. One may chose to display this distribution using either a vertical bar chart, a horizontal bar chart or a line chart by clicking on the corresponding button. Four statistics may also be represented on those charts:

· Case Occurrence - number of cases in this subgroup containing at least one of these words.

· Category percent - percentage of cases in this subgroup containing at least one of these words.

· Word Frequency - total number of these words in this subgroup.

· Rate per 10,000 words - rate of words in this subgroup per 10,000 words.

Right-clicking anywhere in the chart areas displays popup menu that allows one to edit the chart, save it to disk or in the report manager, or copy it to the clipboard. Clicking a specific bar or a data point of a line chart also allows one to retrieve text segments associated with the selected class and containing words of the selected topic.

Im Dokument User ’ s Guide (Seite 45-49)