Random Forest

  1. Skim Sections 2.3 and 2.5 of this book.


  2. Download my starter kit for Java or C++. Look at Main.java (or main.cpp) and make sure you understand what it is doing at a high level. Also, look at the data in the "data" folder. Note that this data contains both continuous and categorical values. The Matrix.loadARFF method loads these datasets. It converts categorical values to enumerated values. The Matrix.valueCount method returns the number of categories in a particular column. It returns 0 if the column holds continuous values.


  3. Implement a decision tree learner. I recommend starting by implementing the basic data structures you will need:
    abstract class Node
    {
    	abstract boolean isLeaf();
    }
    
    class InteriorNode extends Node
    {
    	int attribute; // which attribute to divide on
    	double pivot; // which value to divide on
    	Node a;
    	Node b;
    
    	boolean isLeaf() { return false; }
    }
    
    class LeafNode extends Node
    {
    	double[] label;
    
    	boolean isLeaf() { return true; }
    }
    
    class DecisionTree extends SupervisedLearner
    {
    	Node root;
    
    }
    
    • Your DecisionTree class should inherit from the SupervisedLearner class.
    • The train method should build a decision tree. This is easiest to implement recursively.
    • Interior nodes will split on feature values. Leaf nodes will store a label vector.
    • The predict method should use the tree to make a prediction.
    • Training is done by recursively dividing the data until fewer than k samples remain.
    • Your leaf nodes should store a label vector. Use the approach that baseline uses to combine the k label vectors into a single label.
    • Your tree should be able to handle both continuous and categorical attributes. (Note that they must be handled differently. You can use the Matrix.valueCount method to determine whether an attribute is continuous or categorical. When you divide into two Matrices, be sure to preserve the Matrix meta-data too, so it will know which attributes are continuous or categorical.)
    • Your interior nodes need only make binary divisions, so each interior node should store an attribute index and a value on which it divides. For continuous values, all samples less than the value split one way, and all samples greater-than-or-equal-to the value split the other way. For categorical values, all samples equal to the value split one way, and all samples not equal to the value split the other way.
    • Choose the divisions by picking a random attribute and sample from the remaining training features.
    • Make sure your tree is robust against bad splits. In other words, if all the data goes to the same side of the split, don't make an interior node for that, just try to find a better split. If you cannot split the data after several attempts, give up and make a leaf node.
    • You may assume there are no missing values in the data. (Lots of real-world data has missing values, but you don't have to handle that in this assignment.)


  4. Implement a RandomForest class.
    • It should inherit from the SupervisedLearner class.
    • The train method should instantiate and train n DecisionTree instances.
    • Generate new training data for each DecisionTree by sampling with replacement. (This is called bootstrapping.) Then, train all the DecisionTrees
    • Give each DecisionTree an equal vote in the predictions. This is called aggregating.


  5. Have your code evaluate the accuracy of Baseline, DecisionTree, and RandomForest with 30 trees against each of the three datasets provided in the starter kit. If your implementation is correct, DecisionTree should beat Baseline, and RandomForest should beat DecisionTree.


  6. Click on the "Project submission" link on the main class page to submit your archive file. If you have any difficulties with the submission server, please e-mail your submission directly to the grader.


Rubrick:

Your results should be at least as good as these:
Misclassifications by Baseline at hep = 14/78
Misclassifications by Baseline at vow = 417/495
Misclassifications by Baseline at soy = 278/312
Misclassifications by DecisionTree at hep = 21/78
Misclassifications by DecisionTree at vow = 272/495
Misclassifications by DecisionTree at soy = 157/312
Misclassifications by RandomForest at hep = 16/78
Misclassifications by RandomForest at vow = 123/495
Misclassifications by RandomForest at soy = 52/312
If yours are slightly worse, try a few different random seeds. If yours are much worse, you have a bug to hunt down.



FAQ:

  1. Q: Are there any good ways to debug this?
    A: Yes. Try training and testing on the same data. You should get very high accuracy. The only patterns it should misclassify are the ones where the same feature vector maps to a different label.


  2. Q: I'm still having trouble debugging this. How can I make it easier?
    A: Try training and testing with this simple data. If it fails to yield perfect accuracy, then that is a bug, and your decision tree should be small enough to be easy to debug.
    @relation simple_features
    @attribute x1 real
    @attribute x2 {a,b,c}
    @data
    1,b
    2,a
    3,c
    5,b
    
    @relation simple_labels
    @attribute class {x,y,z}
    @data
    x
    y
    x
    z
    


  3. Q: My trees cannot even predict the training data because they keep picking random splits that do not separate anything. How do I fix that?
    A: Make a few attempts to split the data before you give up and make a leaf node. Example:
    // Try up to 10 times to find a good split
    for(int attempts = 0; attempts < 10; attempts++)
    {
    	randomly_split_the_data();
    	if(a.rows() > 0 && b.rows() > 0)
    		break;
    }
    


  4. Q: Mine does well with some datasets, but not others. What does that mean?
    A: These datasets have different properties. For example, Hep is mostly binary values. Vowel is mostly in continuous space. Soy has categorical attributes with many categories. Considering these properties might help to identify where your bug is located.