Back to the table of contents

Previous      Next

Examples of supervised learning

This document shows examples of using supervised learning algorithms.

Train a decision tree

First, let's train a decision tree model using the zoo dataset. (You can get zoo.arff and many other datasets at MLData.org.) With this dataset, attribute 0 contains enough information to fully solve the problem. That's no fun, so we will ignore attribute 0 during training.

waffles_learn train zoo.arff -ignore 0 decisiontree > dt.json

Now, let's take a look at the decision tree that it has built.

waffles_plot printdecisiontree dt.json zoo.arff -ignore 0
It will output the following:
|
milk?
   |
   +false->feathers?
   |   |
   |   +false->backbone?
   |   |   |
   |   |   +false->airborne?
   |   |   |   |
   |   |   |   +false->predator?
   |   |   |   |   |
   |   |   |   |   +false->Is legs < 3?
   |   |   |   |   |   |
   |   |   |   |   |   +Yes->type=invertebrate
   |   |   |   |   |   |
   |   |   |   |   |   +No->type=insect
   |   |   |   |   |
   |   |   |   |   +true->type=invertebrate
   |   |   |   |
   |   |   |   +true->type=insect
   |   |   |
   |   |   +true->fins?
   |   |       |
   |   |       +false->aquatic?
   |   |       |   |
   |   |       |   +false->type=reptile
   |   |       |   |
   |   |       |   +true->eggs?
   |   |       |       |
   |   |       |       +false->type=reptile
   |   |       |       |
   |   |       |       +true->type=amphibian
   |   |       |
   |   |       +true->type=fish
   |   |
   |   +true->type=bird
   |
   +true->type=mammal

Cross validation

Now, let's test some supervised learning algorithms. We'll use 50x2 cross-validation to test the predictive accuracy of various models on the iris dataset. We'll do it with baseline (which always predicts the most common class), a decision tree, an ensemble of 30 decision trees, a 3-NN instance learner, a 5-NN instance learner, naive bayes, a perceptron, and a neural network with one hidden layer of 4 nodes.

waffles_learn crossvalidate -reps 50 -folds 2 iris.arff baseline
waffles_learn crossvalidate -reps 50 -folds 2 iris.arff decisiontree
waffles_learn crossvalidate -reps 50 -folds 2 iris.arff bag 30 decisiontree end
waffles_learn crossvalidate -reps 50 -folds 2 iris.arff knn -neighbors 3
waffles_learn crossvalidate -reps 50 -folds 2 iris.arff knn -neighbors 5
waffles_learn crossvalidate -reps 50 -folds 2 iris.arff naivebayes
waffles_learn crossvalidate -reps 50 -folds 2 iris.arff neuralnet
waffles_learn crossvalidate -reps 50 -folds 2 iris.arff neuralnet -addlayer 4

As an example, the output with the 5-nn model is:

Rep: 0, Fold: 0, Accuracy: 0.97333333333333
Rep: 0, Fold: 1, Accuracy: 0.89333333333333
Rep: 1, Fold: 0, Accuracy: 0.98666666666667
Rep: 1, Fold: 1, Accuracy: 0.96
Rep: 2, Fold: 0, Accuracy: 0.96
Rep: 2, Fold: 1, Accuracy: 0.94666666666667

...

Rep: 47, Fold: 0, Accuracy: 0.94666666666667
Rep: 47, Fold: 1, Accuracy: 0.96
Rep: 48, Fold: 0, Accuracy: 0.98666666666667
Rep: 48, Fold: 1, Accuracy: 0.96
Rep: 49, Fold: 0, Accuracy: 0.98666666666667
Rep: 49, Fold: 1, Accuracy: 0.98666666666667
-----
Attr: 4, Mean predictive accuracy: 0.96226666666667, Deviation: 0.019238476224803

Incidentally, you might be wondering why I use 50x2 cross-validation instead of the more common 1x10 cross-validation. It turns out that the latter is more prone to type-II error. To prove this, try both of the following commands:

waffles_learn crossvalidate -reps 1 -folds 10 iris.arff baseline

waffles_learn crossvalidate -reps 50 -folds 2 iris.arff baseline
Since the baseline algorithm always predicts the most common class label, and since the iris dataset has three class labels in equal proportion, the correctly-measured accuracy of baseline should come out to about 0.33333. As you can see by trying those two commands, 10-fold cross-validation incorrectly measures the accuracy as being somewhat lower. 2-fold cross-validation is less prone to this error, but it is more volatile. Thus, I like to perform many reps of 2-fold cross-validation to get an accurate estimate of predictive accuracy.

Neural networks

Let's do an example of learning with a neural network, since this model supports a lot of potential parameters. I like to begin by entering an incomplete command.

waffles_learn crossvalidate lungCancer.arff neuralnet -asdf
What does the "-asdf" flag do? Nothing. It is an error. I added a bogus flag to the end to make sure that it would fail to parse my command, and would print out the usage information for the neuralnet model. Here is what it prints:
Invalid neuralnet option: -asdf                                              

Partial Usage Information:

neuralnet 
   A single or multi-layer feed-forward neural network. It is trained with
   online backpropagation. Only continuous values are supported, so it is 
   common to wrap it in a nominaltocat filter so it can handle discrete   
   attributes too. It is also common to wrap that in a normalizing filter, to
   ensure that any continuous inputs are within a reasonable range.          
                                                                    
      -addlayer [size]                                                       
         Add a hidden layer with "size" logisitic units to the network. You may
         use this option multiple times to add multiple layers. The first layer
         added is adjacent to the input features. The last layer added is      
         adjacent to the output labels. If you don't add any hidden layers, the
         network is just a single layer of sigmoid units.                      
      -learningrate [value]                                                    
         Specify a value for the learning rate. The default is 0.1             
      -momentum [value]                                                        
         Specifies a value for the momentum. The default is 0.0                
      -windowepochs [value]                                                    
         Specifies the number of training epochs that are performed before the 
         stopping criteria is tested again. Bigger values will result in a more
         stable stopping criteria. Smaller values will check the stopping      
         criteria more frequently.                                             
      -minwindowimprovement [value]                                            
         Specify the minimum improvement that must occur over the window of    
         epochs for training to continue. [value] specifies the minimum        
         decrease in error as a ratio. For example, if value is 0.02, then     
         training will stop when the mean squared error does not decrease by   
         two percent over the window of epochs. Smaller values will typically  
         result in longer training times.                                      
      -dontsquashoutputs                                                       
         Don't squash the outputs values with the logistic function. Just      
         report the net value at the output layer. This is often used for      
         regression.                                                           
      -crossentropy                                                            
         Use cross-entropy instead of squared-error for the error signal.
      -activation [func]
         Specify the activation function to use with all subsequently added
         layers. (For example, if you add this option after all of the
         -addlayer options, then the specified activation function will only
         apply to the output layer. If you add this option before all of the
         -addlayer options, then the specified activation function will be used
         in all layers. It is okay to use a different activation function with
         each layer, if you want.)
         logistic
            The logistic sigmoid function. (This is the default activation
            function.)
         arctan
            The arctan sigmoid function.
         tanh
            The hyperbolic tangeant sigmoid function.
         algebraic
            An algebraic sigmoid function.
         identity
            The identity function. This activation function is used to create a
            layer of linear perceptrons. (For regression problems, it is common
            to use this activation function on the output layer.)
         bidir
            A sigmoid-shaped function with a range from -inf to inf. It
            converges at both ends to -sqrt(-x) and sqrt(x). This activation
            function is designed to be used on the output layer with regression
            problems intead of identity.
         gaussian
            A gaussian activation function
         sinc
            A sinc wavelet activation function

To see full usage information, run:
        waffles_learn usage

For a graphical tool that will help you to build a command, run:
        waffles_wizard
(If intentionally causing an error in order to obtain useful info makes you feel uncomfortable, you can also find this same information by doing either of the two commands suggested at the end of that error message.)

So now, using the helpful usage information that was just printed, we put together a command.

waffles_learn crossvalidate -seed 0 lungCancer.arff neuralnet -activation arctan -addlayer 100\
               -learningrate 0.2 -momentum 0.8 -minwindowimprovement 0.00001 -windowepochs 500

The mean accuracy that I get is 0.481. That might not sound very good, but this is a rather hard problem. For comparison, let's see how random forest does with this dataset:

waffles_learn crossvalidate -seed 0 lungCancer.arff bag 64 decisiontree -random 1 end
It gets 0.338. It looks like the neural net actually did pretty well after all. In the hands of an expert, neural networks can be extremely powerful models. Unfortunately, they have a lot of parameters. In the hands of a novice, ensembles of decision trees are often a very good choice.

So, now that we've found a pretty good set of parameters, let's train a generalizing model.

waffles_learn train lungCancer.arff neuralnet -activation arctan -addlayer 100 -learningrate 0.2\
              -momentum 0.8 -minwindowimprovement 0.00001 -windowepochs 500 > model.json

Transduction

Some learning algorithms do not have a model. Such algorithms cannot be trained, but they can still be used to transduce. That is, they follow the patterns established by a labeled set to predict suitable labels for an unlabeled set. Let's take a look at some of these algorithms. In this example, we'll use the sonar dataset, which you can download from the usual place.

First, let's shuffle the data, and then split it into two parts:

waffles_transform shuffle sonar.arff > s_shuffled.arff
waffles_transform split s_shuffled.arff 100 s1.arff s2.arff

Next, we'll test the transductive accuracy using three different transduction algorithms. (Transduction can also be done with regular supervised learning models, so we'll throw in a couple of those at the end too.)

waffles_learn transacc -seed 0 s1.arff s2.arff agglomerativetransducer
waffles_learn transacc -seed 0 s1.arff s2.arff graphcuttransducer -neighbors 5
waffles_learn transacc -seed 0 s1.arff s2.arff neighbortransducer -neighbors 5
waffles_learn transacc -seed 0 s1.arff s2.arff decisiontree
waffles_learn transacc -seed 0 s1.arff s2.arff knn -neighbors 5
The first three algorithms transduce directly from s1.arff to predict labels for s2.arff without ever creating a model. The "decisiontree" and "knn 5" models just train on s1.arff, then test with s2.arff, and then throw away their models.

The accuracies that I get are 0.741, 0.75, 0.75, 0.667, and 0.722 respectively. It looks like "graphcuttransducer 5" and "neighbortranducer 5" are tied for the most accurate algorithm. The latter was somewhat faster, however, so I think that makes it the winner. So now let's do transduction to predict labels using that algorithm.

waffles_learn transduce s1.arff s2.arff neighbortransducer 5

Ensembles

Let's start with a Random Forest example:

waffles_learn crossvalidate iris.arff randomforest 30
The output will be:
Rep: 0, Fold: 0, Mean squared error: 0.066666666666667
Rep: 0, Fold: 1, Mean squared error: 0.04
Rep: 1, Fold: 0, Mean squared error: 0.08
Rep: 1, Fold: 1, Mean squared error: 0.04
Rep: 2, Fold: 0, Mean squared error: 0.08
Rep: 2, Fold: 1, Mean squared error: 0.093333333333333
Rep: 3, Fold: 0, Mean squared error: 0.053333333333333
Rep: 3, Fold: 1, Mean squared error: 0.066666666666667
Rep: 4, Fold: 0, Mean squared error: 0.066666666666667
Rep: 4, Fold: 1, Mean squared error: 0.053333333333333
Misclassification rate: 0.064
Predictive accuracy: 0.936

You can achieve exactly the same results with a bagging ensemble of random trees (because that's what Random forest is):

waffles_learn crossvalidate iris.arff bag 30 decisiontree -random 1 end

Of course, you can also put other things in the bag besides decision trees:

waffles_learn crossvalidate iris.arff bag 1 decisiontree 1 naivebayes 1 knn 5 1 knn 3 end

Another popular ensemble is a bucket. Buckets uses cross-validation to select the best model in the bucket for the specified task.

crossvalidate iris.arff bucket decisiontree meanmarginstree naivebayes knn 3 knn 5 naiveinstance 32 end
If you've got computing cycles to burn, you might as well throw lots of models in the bucket, then you're sure to get high accuracy on every problem.

You can even put bags in your buckets, and vice versa. The following model seems to be quite powerful.

waffles_learn crossvalidate iris.arff bucket bag 64 decisiontree end\
    bag 64 decisiontree -random 1 end bag 64 meanmarginstree end end

Despite the popularity of bagging, there are newer ensemble methods that usually outperform it. One such method is Bayesian Model Combination:

waffles_learn crossvalidate iris.arff bmc 30 decisiontree -random 1 end
BMC requires more computation than bagging, but I find that it is usually worth it.

Coding

The examples on this page are about using the Waffles command-line tools. When you are ready to integrate our learning algorithms into your code, see this document.


Previous      Next

Back to the table of contents