Classification Page - User ’ s Guide

Automated text categorization is a supervised machine-learning task by which new documents are classified into one or several predefined category labels based on an inductive learning process performed on a set of previously classified documents. This machine-learning approach of classification has been known to achieve comparable if not superior accuracy than classification performed by human coders, yet at a very low cost in manpower. It has been used to automatically classify documents into proper categories or to find relevant keywords describing the content and nature of a document. It has also been used to automatically file or re-route documents or messages to their appropriate destinations, to classify newspaper articles into proper sections or conference papers into relevant sessions, to filter emails or documents (like spam filtering), or to route a specific request in an organization to the appropriate department. Automated text categorization may also be used to identify the author of a document of unknown or disputed authorship.

For a good overview of automated text classification, see Sebastiani (1999).

The automated text categorization module in WordStat allows one to apply either Naive Bayes or K-Nearest Neighbors learning algorithms on an existing textual database in order to develop a categorization model (or classifier). The program also provides features to test the accuracy of the classification and to optimize the various parameters. Once optimized, the obtained classification model may be used immediately to classify uncategorized documents or may be saved on disk to be applied later outside WordStat using the WordStat Document Classifier utility program. The classifier may also be incorporated into a desktop or web application or within a document management system using the WordStat Software Developer's Kit.

The development and application of a text classifier often involve the following steps:

1. Removal of function words, words that appear in only a few documents and words that appear too often.

2. Dimension reduction, through lemmatization, stemming, categorization, word clustering or other dimension-reduction techniques.

3. Feature selection, which consists of a selection of terms based on their capability to discriminate between categories of documents.

4. Training the classifier on the train set.

5. Testing the accuracy of the classification on a test set.

6. Applying the classifier to new documents.

While the basic content analysis features of WordStat may be used to deal with the first two steps, the Automated Text Categorization dialog box allows one to accomplish tasks related to the last four steps. This dialog box consists of four pages:

· The Select Features page allows one to apply various feature selection methods to select a subset of terms to be used by the classifier.

· The Learn & Test page is the location where machine-learning algorithms are set and tested. This page also allows the storing of classification models to disk.

· The History & Experiment page keeps track of every learning test performed during a session allowing one to choose the best setting and algorithm for a specific classification task. It also gives access to a batch experiment dialog box that may be used to define numerous tests and perform them all at once.

· The Apply page is used to apply a classifier to an external document, a list of documents or to the current data file.

Accessing the Automated Text Classification dialog box

To develop a classifier for a specific categorical variable, you need to select a categorical variable containing the values you want to predict. While it can be done in WordStat, the most common way to select it is from QDA Miner or SimStat while calling WordStat. In QDA Miner, one has to choose this categorical variable in the In Relation with section. In Simstat, this variable should be assigned to the independent list box and assign the text variables on which the prediction should be based to the dependent list box. Once in WordStat, set the various text processing options (such as the lemmatization, the exclusion and categorization lists, and all the other analysis options needed) to obtain the desired list of keywords or content categories. Then move to the Classification page.

Settings

The Settings page allows one to select which variable contains the values to predict and choose a validation method.

Selecting the variable to predict

To select the variable containing the values to predict, set the first list box to the name of this variable. If its name is not listed, choose the <Select Variables> item to display a list of all available variables, select the variable to predict and click OK. Then set the drop down list to this newly added variable.

Selecting the validation Method

The evaluation of a classifier consists of measuring its effectiveness at classifying documents that have already been classified. Those documents should, however, not be part of the training set used to develop the classification model, since it would likely overestimate the real performance of the classifier. Yet, training a classifier on only a portion of the available training set may result in a less than optimal classifier.

Cross-validation methods have been proposed as a compromise solution that allows one to develop a classification model on all the available documents in the training set yet provide a somewhat more realistic estimate of the classifier performance. WordStat offers three broad types of validation methods:

Leave-one-out - This cross-validation method consists of 1) removing a document from the training set, 2) developing a classification model on the remaining documents, 3) applying this model to predict the membership of this single document and 4) comparing the decision made by the classifier to the actual class to which this document belongs. This procedure is then repeated for each document in the training set and the different decisions are combined to estimate the performance of the classifier. While this method logically involves the computation of a large number of models and may seem to be time consuming, in practice the classification model is computed only once but adjusted analytically to remove the contribution of the test document prior to its classification. This cross-validation method will often overestimate the performance of a classifier if the training set includes duplicate documents or if included documents are not totally independent from one another.

n-folds - This method consists of splitting the training set into smaller partitions and testing each partition on the classification performance obtained by a model developed on the remaining ones. For example, when using a five-fold cross-validation method, the training set is divided randomly into five subsets, each containing approximately 20% of the documents. For each subset, the program tests the accuracy obtained by a classification model developed on the remaining 80% of the original training set. The performances obtained on all five classifiers are then used to estimate the performance of the

classifier computed on the full training set. WordStat provides a choice between five-fold, 10-fold and 20-fold cross-validation.

External file - A more conventional method for assessing the performance of a classifier is to test the accuracy of the classifier on an entirely different set of documents that have also been classified but are totally independent of the training set on which the categorization model is based. To perform such a test, WordStat requires the test set to be stored in a different data file. When this option is selected, an Open File dialog box is displayed allowing one to identify the file containing the external set. WordStat then displays a dialog box like the one below allowing one to choose the text variable containing the documents to be used for classification and the numerical variable containing the class to which this document belongs. Once set, click OK to return to the classification page.

Once the variable and the validation method have been set, click the button to continue.

WordStat will compute all statistics needed, and will then automatically move to the Select Features page.

Im Dokument User ’ s Guide (Seite 97-100)