Back to the table of contents

Previous Next

Document classification

This page will visit some of the techniques used for document classification.

Let's suppose that you have gathered a collection of training documents in .txt or .html format. These documents are separated into three folders named "ham", "spam", and "auto_responses". You'd like to train some classification algorithm to decide in which folder new documents probably belong.

Since most classification algorithms are designed to operate on structured tables of information, the first step is to convert the collection of documents into a table format. This is typically done by assigning an attribute to each possible word, and a row to each document. Each element in the table will specify some sort of frequency-count for the number of times that a certain word occurs in a certain document.

As you might predict, this is going to be a humongous table! Fortunately, it is going to be very sparse. That is, most elements in the table are going to have a value of 0. If we re

The "waffles_sparse" application contains a tool that will convert a bunch of documents to a sparse matrix in this manner:

	waffles_sparse docstosparsematrix ham spam auto_responses

This creates two files, named "features.sparse" and "labels.arff". features.sparse is a sparse feature matrix, where each row represents a document, and each element represents the occurrence of each word in the document. labels.arff is a single-column matrix that specifies the class of each document.

The values in each element of the matrix are computed as a/b*log(c/d), where 'a' is the number of times the word occurs in this document, 'b' is the max number of times this word occurs in any document, 'c' is the total number of documents, and 'd' is the number of documents that contain the word.

By default, the words are stemmed using the Porter stemming algorithm. If you're not using English, this might just make a mess of your words. You can suppress this feature with the "-nostem" flag.:

	waffles_sparse docstosparsematrix -nostem ham spam auto_responses

Now that you've got your data into a sparse matrix, you might want to shuffle the rows. This command will shuffle features.arff and labels.arff in a manner that preserves corresponding rows:

	waffles_sparse shuffle features.sparse -labels labels.arff l_shuffled.arff > f_shuffled.sparse

Next, let's use cross-validation to decide which model is the best. The following commands will divide f_shuffled.sparse into two files named train.sparse and test.sparse, and will divide l_shuffled.arff into two files named train.arff and test.arff:

	waffles_sparse splitfold f_shuffled.sparse 0 10
	waffles_transform splitfold l_shuffled.arff 0 10

This example shows how to divide the data for fold 0. To do ten-fold cross-validation, we would repeat these commands for each fold from 0 to 9.

Now, it's time to train a classifier. First, you need to select a classification algorithm that can be trained with a sparse matrix. These are the algorithms that inherit from GIncrementalLearner. To see the full list of choices, go the the API docs, open up the "class hierarchy", then navigate to GTransducer->GSupervisedLearner->GIncrementalLearner. The classes that inherit from GIncrementalLearner are the ones that can train on a sparse matrix. GKNN and GNaiveBayes are probably the most common choices for this task.

Let's train a 7-nn instance learner that uses the cosine method to evaluate the similarity between sparse vectors:

	waffles_sparse train train.sparse train.arff knn 7 -cosine > model.json

And finally, let's test our trained model to see how well it classifies the documents in the test portion of our data:

	waffles_sparse test model.json test.sparse test.arff

This will print a predictive accuracy score to stdout.

Previous Next

Back to the table of contents