Random Forest are one of the most powerful, fully automated, machine learning techniques. With almost no data preparation or modeling expertise, analysts can effortlessly obtain surprisingly effective models. “Random Forests” is an essential component in the modern data scientist’s toolkit and in this brief overview we touch on the essentials of this ground breaking methodology.
Random Forests was originally developed by UC Berkeley visionary Leo Breiman in a paper he published in 1999, building on a lifetime of influential contributions including the CART decision tree. As he continued to perfect random random forest he worked with his long time collaborator and former student Adela to develop final form of Random Forests including the sophisticated graphics permitting deeper data understanding.
Random Forest is a tool that leverages the power of many decision trees, judicious randomization, and ensemble learning to produce astonishingly accurate predictive models, insightful variable importance rankings, missing value imputations novel segmentation and laser -sharp reporting on a record-by record basis for deep understanding.
We start with a suitable collection of data including variables we would like to predict or understanding and relevant predictors.
Random Forest can be used to predict continuous variables such as sales of a product on a website or the loss predicted if an insured files a claim.
Random Forest can also be used to estimate the probability that a particular outcome occurs.
Outcomes can be “yes/no” events or one of several possibilities, such as which models of cell phone a customer will buy.
Could have many possible outcomes but typically multi-class problems have 8 or fewer outcomes.
Random Forest are collections of decisions trees that together produce predictions and deep insights into the structure of data.
The core building block of a Random Forest is a CART inspired decision tree.
Leo-Breiman’s earliest version of the random forest was the “bagger”
Imagine drawing a random sample from your main data base and building a decision tree of this random sample.
The “sample” typically would use half of the available data although it could be a different fraction of the master data base.
Now repeat the process. Draw a second different random sample and grow a second decision tree.
The predictions made by this second decision tree will typically be different (at least a little) than those of the first tree.
Continue generating more trees each built on a slightly different sample and generating at least slightly different predictions each time.
This process could be continued indefinitely but we typically grow 200 to 500 trees.
- Each of our trees will generate its own specific predictions for each record in the data base.
- To combine all of these separate predictions we can use either averaging or voting.
- For predicting an item such as sales volume we would average the predictions made by our trees.
- To predict a classification outcome such as “click| no-click” we can collect counts of votes. How many trees voted “click” vs how many “no-click” will determine prediction
- For classification we can also produce a predicted probability of each possible outcomes based on relative share of votes for each outcome.
Weakness of Bagger
- The process just described is known as the “bagger”
- The bagger represented a distinct advance in machine learning when it was first introduced in 1994.
- Brieman discovered that the bagger was a good machine learning method but not as accurate as he had hoped for.
- Analyzing the details of many models he came to the conclusion that the trees in the bagger were too similar to each other
- His repair was to find a way to make the trees dramatically more different.
Putting Randomness into Random Forest
- Breiman’s key new idea was to introduce randomness not just into the training samples but also into the actual tree growing as well
- In growing a decision tree we normally conduct exhaustive searches across all possible predictors to find the best possible partition of data in each node of the tree.
- Suppose that instead of always picking the best splitter we picked the splitter at random.
- This would guarantee that different trees would be quite dissimilar to each other.
How Random Should a Random Forest Be?
- At one extreme, if we pick every splitter at random we obtain randomness everywhere in the tree.
- This usually does not perform very well
- A less extreme method is to first select a subset of candidate predictors at random and then produce the split by selecting the best splitter actually available.
- If we had 1,000 predictors we might select a random set of 30 in each node and then split using the best predictor among the 30 available instead of the best among the full 1,000.
- Beginners often assume that we select a random subset of predictors once at the start of the analysis and then grow the whole tree using this subset.
- This is how random forest work
- In Random Forest we select a new random subset of predictors in each node of a tree.
- Completely different subsets of predictors may be considered in different nodes.
- If the tree grows large then by the end of the process a rather large number of predictions have had a chance to influence the tree.
Controlling Degree Of Randomness
- If we always search all predictors in every node of every tree we are building bagger models typically not so impressive in their performance.
- Models will usually improve if we search fewer than all the variables in each node restricting attention to a random subset.
- How many variables to consider is a key control and we need to experiment to learn the best value
- Breiman suggested starting with the square root of the number of available predictors.
- Allowing just one variable to be searched in each node almost always yields inferior results but often allowing 2 or 3 instead can yield impressive results.
How Many Predictors in Each Node
|N Predictors||Sqrt||.5 Sqrt||2*sqrt||ln2|
In the table above we show some of the values Breiman and Cutler advised. They suggested four possible rules: Square root of the total number of predictors, or one half or twice the square root and log base2.
We recommend experimenting with some other values. The value chosen remains in effect for the entire forest and remains the same for every node of every tree grown.
Random Forests Predictions
- Predictions will be generated for a forest as we did for the bagger, that is, by averaging or voting.
- If you want you can obtain the prediction made by each tree and save it to a database or spreadsheet.
- You could then create your own custom weighted averages or make use of the variability of the individual tree predictions
- Example a record that is predicted to show a sales volume in a relatively narrow range across all trees is less uncertain than one that has the same average prediction but a large variation in individual tree predictions.
- It is easiest to just let Random Forests do the work for you and save final predictions.
Out Of Bag (OOB ) Data
- If we sample from our available training data before growing a tree then we automatically have hold out data available (for that tree)
- In random forests this holdout data is known as “Out Of Bag” data
- There is no need to be concerned about the rationale for this terminology at this point
- Every tree we grow has a different holdout sample associated with it because every tree has a different training sample
- Alternatively every record in the master database will be “in bag” for some trees (used to train that tree) and “out of bag” for other trees (not used to grow the tree)
Testing And Evaluating
- Keeping track of for which trees a specific record was OOB allows us to easily and effectively evaluate forest performance
- Suppose that a given record was in bag for 250 trees and out-of- bag for another 250 trees.
- We could generate predictions for this specific record using just the OOB trees.
- The result would give us an honest assessment of the reliability of the forest since the record was never used to generate any of the 250 trees.
- Always having OOB data means that we can effectively work with relatively small records
More Testing & Evaluation
- We can use the OOB idea for every record in the data
- Note that every record is evaluated on its own specific subset of OOB trees and typically no two records would share the identical pattern of in-bag versus out of bag trees.
- We would always reserve some additional data as a traditional hold out sample but this is not necessary for Random Forests
- The idea of OOB testing is an essential component of Random Forest data analytics
Testing Vs … Scoring
- For model Assessment using OOB data we use a subset of trees (the OOB trees) to make predictions for each record.
- When Forecasting or scoring new data we would make use of every tree in the forest as no tree would have been built using the new data
- Typically this means that scoring yields better performance than indicated by the internal OOB results
- The reason is that in scoring we can leverage the full forest and thus benefit from averaging the predictions of a much large number of trees.
Random Forest & Segmentation
- Another essential idea in Random Forest analysis is that of “proximity” or the closeness of one data record to another
- Consider two records of data selected from our database. We would like to know how similar these records are to each other
- Drop the pair of records down each tree and note if the two end up in the same terminal node or not
- Count the number of times the record “match” and divided by the number of trees tested.
- We can count the number of matches found in this way for every pair of records in the data
- This produces a possibly very large matrix . A 1000 record data base would produce a 1000*1000 matrix with 1 million elements.
- Each entry in the matrix displays how close two data records are to each other
- Need to keep the size of this matrix in mind if we wish to leverage the insights into the data it provides.
- To keep our measurements honest we can be selective in how we use the trees. Instead of using every tree for every pair of records we could use only trees in which one or both records are OOB.
- This does not affect matrix size but affects the reliability of the proximity measures.
Characteristics of the Proximity Matrix
- The random Forests proximity matrix has some important advantages fewer traditional near neighbor measurements.
- Random Forest naturally handles mixtures of continuously and categorical data.
- There is no need to come up with a measurement of nearness that applies to a specific variable. The forest works with all variables together to measure nearness or distance directly.
- Missing values are also no problem as they are handled automatically in tree construction.
- Breiman and Cutler made use of the proximity matrix in various ways
- One use is in the identification of “outliers “
- An outlier is a data value that is noticeably different than we would expect given all other relevant information
- As such an outlier would be distant from records that to which we would expect it to be close
- We would expect records that are “events” to be closer to other “events” than to non-events
- Records that do not have any appropriate neighbors are natural candidates to be outliers
- Random Forests produces an “outliers score” for each record
- Ideally we would like to plot the records in our data to reveal clusters and outliers
- Might also have a cluster of outliers which would be best detected visually.
- In random forests we do this by plotting projections of the proximity matrix to a 3D approximation.
- These graphs can suggest how many clusters appear naturally in the data (at least if there are only a few)
- We show such a graph later in there notes
- Classic random forests offer two approaches to missing values
- The simple method and default is to just fill in the missing values with overall simple for means or most common values for categorical predictors.
- In this approach eg, all records with missing “AGE” would be filled in with the same average value
- While CRUD, simple method works well due to enormous amount of randomization and averaging associated with any forest.
Proximity and Missing Values
- The second “Advanced” way to deal with missing values involves several repetitions of forest building
- We start with the simple method and produce the proximity matrix
- We than replace the simple imputations in the data with the new imputations
- Instead of using unweighted averages to calculate the imputation we weight the data by proximity
- To impute a missing value for x for a specific record we essentially look at the good values of x among the records closest to the one with the missing value.
- Each record of data could thus obtain a unique imputed value.
- The advanced method is actually very common sense
- Suppose we are missing the age of a specific customer
- We use the forest to identify how close the record in question is to all other records
- Impute by producing a weighted average of the age of other customers most “like” the one needing imputation.
- Random forest include an innovative method to measure the relative importance of any predator
- Method is based on measuring the damage that would be done to our predictive models if we lost access to true values of a given variable.
- To simulate losing access to a prediction we randomly scramble its values in the data i.e we move the value belonging to a specific row of data to another row
- We scramble just one predictor at a time and measure the consequential loss in predictive accuracy
Variable Importance Details
- If we scramble the value of the variables just once and than measure the damage done to predictive performance we would be reliant on a single randomization.
- In random forests we re-scramble the data anew for the predictor being tested in every tree in forest.
- We therefore free ourselves from the dependence on the luck of single draw.
- If we re-scramble a predictor 500 times in front of 500 trees the results should be highly reliable
Variable importance issues
- If our data includes several alternative measure of the same concept than scrambling just one of these at a time might result in very little damage to modal performance
- For example if we have several credit risk scores, we might be fooled into thinking a single one of them is unimportant
- repeating the scrambling test separately for each credit score could yield the conclusion that each considered separately is unimportant.
- It may thus be important to eliminate this kind of redundancy in the predictors used before putting too much faith in the importance ranking.
Variable Importance : Final Observation
- The data scrambling approach to measuring variable importance is based on the impact of losing access to information on modal performance
- But a variable is not necessarily unimportant just because we can do well without it.
- Need to be aware that a prediction, if available, will be used by models, but if not available then substitute variables can be used instead.
- The ‘gini’ measure is based on the actual role of a predictor & offers an alternative importance assessment based on the role the predictor plays in the data
- In our discussion so far we have suggested that the sampling technique underpinning Random Forest is the drawing of a random 50% of the available data for each tree
- This style of sampling is very easy to understand and is a reasonable way to develop a random forest
- However, bootstrap sampling and random half sampling are similar enough that we do not need to delve into the details here.
The technical Algorithm
- Let the number of training cases be N, and the number of variables in the classifier be M
- We are told the number n of input variables to be used to determine the decision at a node of the tree; m should be less and even much less than M
- Choose a training set for this tree by choosing n times with replacement from all N available training cases( i.e take a bootstrap sample). Use the rest of the cases to estimate the error of the tree by predicting their classes (OOB Data)
- For each node of the tree , randomly choose n variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set.
- Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier)
- For predicting a new sample is pushed down the tree . It is assigned the label of the training sample in the terminal node ends up in.
This procedure is iterated over all trees in the ensemble and the mode vote of all the trees is reported as the random forest prediction.