Learn & Test Page - User ’ s Guide

The Learn & Test page is where machine-learning algorithms are chosen and tested and from which the classifier may be exported to disk.

The panel at the top of the page allows one to select the machine-learning algorithm to use, set various analysis options and choose how the classification model will be tested.

Learning options

METHODS - Two machine-learning algorithms are available for text classification:

The Naive Bayes algorithm classifies text by estimating the probability of a class, given the presence or absence of specific words or keywords in the document to be classified. It first computes the probability of each term to occur in documents of specific classes in the training set.

It then combines the probabilities associated with words found in the document to classify to estimate the probability that this document belongs to different classes. Finally, it assigns the document to the class with the highest probability. A multinomial Naive Bayes model has been chosen to handle both binomial and multinomial classification tasks as well as binary and numerical weights of items.

The k-Nearest neighbor classification method compares a document to be classified to all documents in the training set, retrieves the k most similar documents, and then assigns the new document to the most common classes in this retrieved set. This method is usually known to provide accurate classification when the training set is large enough, yet can be very time-consuming because of the need to compare and rank the entire training set for similarity with the test document. It also usually requires a larger storage space since it must keep frequency

information for all documents in the training set rather than just a few classification rules or mathematical formulas like many other machine-learning methods. However, WordStat uses a very efficient K-NN algorithm that drastically improves the computing speed and reduces the disk space and memory requirement. When this method is chosen, an NO edit box appears below the Method list box, allowing one to set the number of similar documents on which this classification will be based. Values higher than 20 or 30 are typically used in text classification tasks.

USE - This option is used to select the item statistic to be used in training and classification. Choosing Case Occurrence results in the use of binary weights, indicating whether or not a word or keyword occurs in the document. Selecting Keyword Frequency allows one to use additional information related to how often this item occurs in each document. Percentage of Words and Percentage of Keywords provide two methods to normalize the obtained frequency to take into account the document length. Such normalization is performed by dividing the frequency either by the total number of words found in the document or the total number of keywords that have been extracted by WordStat.

FEATURE WEIGHTING - Feature weighting has been presented as an alternative to feature selection or as a way to further improve classification accuracy from selected item sets. This method consists of giving more weight to items that are rather good at differentiating documents from distinct classes and negligible weight to those that are distributed evenly among classes. The most frequently used weight in information retrieval is the TF*IDF measure where the frequency of an item is adjusted to take into account the number of documents containing this item. However, such a weighting can be considered to be only a crude approximation of the capacity of the item to differentiate documents from distinct classes. More accurate performance of the classifier can be expected from using a weight based on a more direct indicator of this discriminative capability such as the Global Chi-square or the Max Chi² described previously.

Results

A common way of assessing the accuracy of a classifier is by comparing the accuracy of predicted class membership against actual membership. Such information is provided by the Confusion Matrix where each predicted class is plotted against the actual class. Accurate predictions are plotted in the diagonal going from the top left to the bottom right of the table. Values in this diagonal are printed in bold characters for easy identification. Values in cells below or above this diagonal represent classification errors. Besides the actual number of documents in each cell, the table shows the row, column and total percentages. Row percentages represent the number of documents in a class that have been classified in a specific way, while column percentages express the percentage of a specific prediction actually belonging to a known class. This table may be used to identify which classes are the easiest or hardest to predict, as well as which classification errors are the most common. To facilitate comparisons across the classes of the categorical variable, two related statistics are printed on the right of the table: Precision is the probability that documents identified as belonging to a class are correctly classified and Recall is the probability of documents in a class to be correctly identified.

Several statistics are provided to assess the global performance of the classifier. The Nominal Accuracy measure is the proportion of documents correctly classified It is considered a micro-average statistic since it gives equal weight to documents regardless of how they are distributed among classes of the categorical variable. The Average Precision and Average Recall measures are macro-average statistics obtained by computing the mean precision and recall obtained for every class. The Ordinal

Accuracy measure weights disagreements so that errors in prediction will be considered higher when the predicted value is far from the original value, while predictions that are closer to the original value will be counted as partial disagreements.

The Confusion List page presents information already found in the confusion matrix but in the form of a single list that allows one to identify more easily the most common errors. The table may be sorted on the actual class of the documents, the predicted classification, the number of times such a classification error occurred, or the proportion of documents that have been misclassified this specific way. By default, the table is sorted in descending order of frequency. To sort the table on values in another column, simply click this column header. Clicking the same column header a second time sorts its content in descending order.

The Review Errors window displays a list of all documents that have been misclassified, allowing one to examine for each document the classification error made by the classifier as well as the computed values associated with every class of the categorical variable. A text window in the bottom of the list also allows one to review the text on which the classification has been made and potentially identify some of the reasons why the document had been misclassified.

Im Dokument User ’ s Guide (Seite 103-106)