Random Forest – Beginners

Random Forest are one of the most powerful, fully automated, machine learning techniques. With almost no data preparation or modeling expertise, analysts can effortlessly obtain surprisingly effective models. “Random Forests” is an essential component in the modern data scientist’s toolkit and in this brief overview we touch on the essentials of this ground breaking methodology.

Random Forests was originally developed by UC Berkeley visionary Leo Breiman in a paper he published in 1999, building on a lifetime of influential contributions including the CART decision tree. As he continued to perfect random random forest he worked with his long time collaborator and former student Adela to develop final form of Random Forests including the sophisticated graphics permitting deeper data understanding.

Random Forest is a tool that leverages the power of many decision trees, judicious randomization, and ensemble learning to produce astonishingly accurate predictive models, insightful variable importance rankings, missing value imputations novel segmentation and laser -sharp reporting on a record-by record basis for deep understanding.


We start with a suitable collection of data including variables we would like to predict or understanding and relevant predictors.

Random Forest can be used to predict continuous   variables such as sales of a product on a website or the loss predicted if an insured files a claim.

Random Forest can also be used to estimate the probability that a particular outcome occurs.

Outcomes can be “yes/no” events or one of several possibilities, such as which models of cell phone a customer will buy.

Could have many possible outcomes but typically multi-class problems have 8 or fewer outcomes.

The Essentials

Random Forest are collections of decisions trees that together produce predictions and deep insights into the structure of data.

The core building block of a Random Forest is a CART inspired decision tree.

Leo-Breiman’s earliest version of the random forest was the “bagger”

Imagine drawing a random sample from your main data base and building a decision tree of this random sample.

The “sample” typically would use half of the available data although it could be a different fraction of the master data base.

Now repeat the process. Draw a second different random sample and grow a second decision tree.

The predictions made by  this second decision tree will typically be different (at least a little) than those of the first tree.

Continue  generating more trees each built on a slightly different sample and generating at least slightly different predictions each time.

This process could be continued indefinitely but we typically grow 200 to 500 trees.


  • Each of our trees will generate its own specific predictions for each record in the data base.
  • To combine all of these separate predictions we can use either averaging or voting.
  • For predicting an item such as sales volume we would average the predictions made by our trees.
  • To predict a classification outcome such as “click| no-click” we can collect counts of votes. How many trees voted “click” vs how many “no-click” will determine prediction
  • For classification we can also produce a predicted probability of each possible outcomes based on relative share of votes for each outcome.

Weakness of Bagger

  • The process just described is known as the “bagger”
  • The bagger represented a distinct advance in machine learning when it was first introduced in 1994.
  • Brieman discovered that the bagger was a good machine learning method but not as accurate as he had hoped for.
  • Analyzing the details of many models he came to the conclusion that the trees in the bagger were too similar to each other
  • His repair was to find a way to make the trees dramatically more different.

    Putting Randomness into Random Forest   

  • Breiman’s key new idea was to introduce randomness not just into the training samples but also into the actual tree growing as well
  • In growing a decision tree we normally conduct exhaustive searches across all possible predictors to find the best possible partition of data in each node of the tree.
  • Suppose that instead of always picking the best splitter we picked the splitter at random.
  • This would guarantee that different trees would be quite dissimilar to each other.

How Random Should a Random Forest Be?

  • At one extreme, if we pick every splitter at random we obtain randomness everywhere in the tree.
  • This usually does not perform very well
  • A less extreme method is to first select a subset of candidate predictors at random and then produce the split by selecting the best splitter actually available.
  • If we had 1,000 predictors we might select a random set of 30 in each node and then split using the best predictor among the 30 available instead of the best among the full 1,000.
  • Beginners often assume that we select a random subset of predictors once at the start of the analysis and then grow the whole tree using this subset.
  • This is how random forest work
  • In Random Forest we select a new random subset of predictors in each node of a tree.
  • Completely different subsets of predictors may be considered in different nodes.
  • If the tree grows large then by the end of the process a rather large number of predictions have had a chance to influence the tree.

Controlling Degree Of Randomness

  • If we always search all predictors in every node of every tree we are building bagger models typically not so impressive in their performance.
  • Models will usually improve if we search fewer than all the variables in each node restricting attention to a random subset.
  • How many variables to consider is a key control and we need to experiment to learn the best value
  • Breiman suggested starting with the square root of the number of available predictors.
  • Allowing just one variable to be searched in each node almost always yields inferior results but often allowing 2 or 3 instead can yield impressive results.

How Many Predictors in Each Node

N Predictors Sqrt .5 Sqrt 2*sqrt ln2
100 10 5 20 6
1,000 31 15.5 62 9
10,000 100 50 200 13
100,000 316 158 632 16
1,000,000 1000 500 2000 19

In the table above we show some of the values Breiman and Cutler advised. They suggested four  possible rules: Square root of the total number of predictors, or one half or twice the square root and log base2.

We recommend experimenting with some other values. The value chosen remains in effect for the entire forest and remains the same for every node of every tree grown.

Random Forests Predictions

  • Predictions will be generated for a forest as we did for the bagger, that is, by averaging or voting.
  • If you want you can obtain the prediction made by each tree and save it to a database or spreadsheet.
  • You could then create your own custom weighted averages or make use of the variability of the individual tree predictions
  • Example a record that is predicted to show a sales volume in a relatively narrow range across all trees is less uncertain than one that has the same average prediction but a large variation in individual tree predictions.
  • It is easiest to just let Random Forests do the work for you and save final predictions.

Out Of Bag (OOB ) Data

  • If we sample from our available training data before growing  a tree then we automatically have hold out data available (for that tree)
  • In random forests this holdout data is known as “Out Of Bag” data
  • There is no need to be concerned about the rationale for this terminology at this point
  • Every tree we grow has a different holdout sample associated with it because every tree has a different training sample
  • Alternatively every record in the master database will be “in bag” for some trees (used to train that tree) and “out of bag” for other trees (not used to grow the tree)

Testing And Evaluating 

  • Keeping  track  of for which trees a specific record was OOB allows us to easily and effectively evaluate forest performance
  • Suppose that a given record was in bag for 250 trees and out-of- bag for another 250 trees.
  • We could  generate predictions for this specific record using just the OOB trees.
  • The result would give us an honest assessment of the reliability of the forest since the record was never used to generate any of the 250 trees.
  • Always having OOB data means that we can effectively work with relatively small records

More Testing & Evaluation

  • We can use the OOB idea for every record in the data
  • Note that every record is evaluated on its own specific subset of OOB trees and typically no two records would share the identical pattern of in-bag versus out of bag trees.
  • We would always reserve some additional data as a traditional hold out sample but this is not necessary for Random Forests
  • The idea of OOB testing is an essential component of Random Forest data analytics

Testing Vs … Scoring

  • For model Assessment using OOB data we use a subset of trees (the OOB trees) to make predictions for each record.
  • When Forecasting or scoring new data we would make use of every tree in the forest as no tree would have been built using the new data
  • Typically this means that scoring yields better performance than indicated by the internal OOB results
  • The reason is that in scoring we can leverage the full forest and thus benefit from averaging the predictions of a much large number of trees.

Random Forest & Segmentation

  • Another essential idea in Random Forest analysis is that of “proximity” or the closeness of one data record to another
  • Consider two records of data selected from our database. We would like to know how similar these records are to each other
  • Drop the pair of records down each tree and note if the two end up in the same terminal node or not
  • Count the number of times the record “match” and divided by the number of trees tested.

Proximity Matrix

  • We can count the number of matches found in this way for every pair of records in the data
  • This produces a possibly very large matrix . A 1000 record data base would produce a 1000*1000 matrix with 1 million elements.
  • Each entry in the matrix displays how close two data records are to each other
  • Need to keep the size of this matrix in mind if we wish to leverage the insights into the data it provides.
  • To keep our measurements honest we can be selective in how we use the trees. Instead of using every tree for every pair of records we could use only trees in which one or both records are OOB.
  • This does not affect matrix size but affects the reliability of the proximity measures.

     Characteristics of the Proximity Matrix 

  • The random Forests proximity matrix has some important advantages fewer traditional near neighbor measurements.
  • Random Forest naturally handles mixtures of continuously   and categorical data.
  • There is no need to come up with a measurement of nearness that applies to a specific variable. The forest works with all variables together to measure nearness or distance directly.
  • Missing values are also no problem as they are handled automatically in tree construction.

Proximity Insights

  • Breiman and Cutler made use of the proximity matrix in various ways
  • One use is in the identification of “outliers “
  • An outlier is a data value that is noticeably different than we would expect given all other relevant information
  • As such an outlier would be distant from records that to which we would expect it to be close
  • We would expect records that are “events” to be closer to other “events” than to non-events
  • Records that do not have any appropriate neighbors are natural candidates to be outliers
  • Random Forests produces an “outliers score” for each record

Proximity Visualization

  • Ideally we would like to plot the records in our data to reveal clusters and outliers
  • Might also have a cluster of outliers which would be best detected visually.
  • In random forests we do this by plotting projections of the proximity matrix to a 3D approximation.
  • These graphs can suggest how many clusters appear naturally in the data (at least if there are only a few)
  • We show such a graph later in there notes

Missing  Values

  • Classic random forests offer two approaches to missing values
  • The simple method and default is to just fill in the missing values with overall simple for means or most common values for categorical predictors.
  • In this approach eg, all records with missing “AGE” would be filled in with the same average value
  • While CRUD, simple method works well due to enormous amount of randomization and averaging associated with any forest.

Proximity and Missing Values

  • The second “Advanced” way to deal with missing values involves several repetitions of forest building
  • We start with the simple method and produce the proximity matrix
  • We than replace the simple imputations in the data with the new imputations
  • Instead of using unweighted averages to calculate the imputation we weight the data by proximity
  • To impute a missing value for x for a specific record we essentially look at the good values of x among the records closest to the one with the missing value.
  • Each record  of data could thus obtain a unique imputed value.

Missing Imputation 

  • The advanced method is actually very common sense
  • Suppose we are missing the age of a specific customer
  • We use the forest to identify how close the record in question is to all other records
  • Impute by producing a weighted average of the age of other customers most “like” the one needing imputation.

Variable Importance 

  • Random forest include an innovative method to measure the relative importance of any predator
  • Method is based on measuring the damage that would be done to our predictive models if we lost access to true values of a given variable.
  • To simulate losing access to a prediction we randomly scramble its values in the data i.e we move the value belonging to a specific row of data to another row
  • We scramble just one predictor at a time and measure the consequential loss in predictive accuracy

Variable Importance Details

  • If we scramble the value of the variables just once and than measure the damage done to predictive performance we would be reliant on a single randomization.
  • In random forests we re-scramble the data  anew for the predictor being tested in every tree in forest.
  • We therefore free ourselves from the dependence on the luck of single draw.
  • If we re-scramble a predictor 500 times in front of 500 trees the results should be highly reliable

Variable importance issues

  • If our data includes several alternative measure of the same concept than scrambling just one of these at a time might result in very little damage to modal performance
  • For example if we have several credit risk scores, we might be fooled into thinking a single one of them is unimportant
  • repeating the scrambling test separately for each credit score could yield the conclusion that each considered separately  is unimportant.
  • It may thus be important to eliminate this kind of redundancy in the predictors used before putting too much faith in the importance ranking.

Variable Importance : Final Observation 

  • The data scrambling approach to measuring variable importance is based on the impact of losing access to information on modal performance
  • But a variable is not necessarily unimportant just because we can do well without it.
  • Need to be aware that a prediction, if available, will be used by models, but if not available then substitute variables can be used instead.
  • The ‘gini’ measure is based on the actual role of a predictor & offers an alternative importance assessment based on the role the predictor plays in the data

Bootstrap Re-sampling

  • In our discussion so far we have suggested that the sampling technique underpinning Random Forest is the drawing of a random 50% of the available data for each tree
  • This style of sampling is very easy to understand and is a reasonable way to develop a random forest
  • However, bootstrap sampling and random half sampling are similar enough that we do not need to delve into the details here.

The technical Algorithm 

  • Let the number of training cases be N, and the number of variables in the classifier be M
  • We are told the number n of input variables to be used to determine the decision at a node of the tree; m should be less and even much less than M
  • Choose a training set for this tree by choosing n times with replacement from all N available training cases( i.e take a bootstrap sample). Use the rest of the cases to estimate the error of the tree by predicting their classes (OOB Data)
  • For each node of the tree , randomly choose n variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set.
  • Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier)
  • For predicting a new sample is pushed down the tree . It is assigned the label of the training sample in the terminal node ends up in.

This procedure is iterated over all trees in the ensemble and the mode vote of all the trees is reported as the random forest prediction.













Decision Tree Algorithm

Tree based learning algorithms are considered to be one of the best and mostly used supervised learning methods (surely debatable). Tree based methods help build predictive models with high accuracy, stability and map non-linear relationships well. They are adaptable at solving both classification and regression based problems.

Methods such as decision trees, random forest, gradient boosting are being popularly used in all kinds of data science problems. Random forest particularly is the algorithm which I intend to describe in detail .PS if you already know about this algorithm may be you should check the kaggle Titanic problem.  Before we get into the details I will be covering this algorithm ground up. The complete algorithm will be divided in 4 posts.

  • Decision Trees
  • Random Forest Basic Overview
  • Mathematical interpolation of Random Forest -“LeoBreiman original paper “
  • Benchmarking Random forest

Well this post is more like an introduction to what should be expected in the upcoming weeks. Please feel free to add comments and suggestions. The tone of the blog isn’t like a professional blogger given I want the blogs to sound more like exchanging thoughts and notes about these algorithms.

Decision Trees

Decision trees are produced by algorithms that identify various ways of splitting a data set into branch-like segments. They form a part of supervised learning algorithm that is mostly used in classification based problems.

Let’s say we have a sample of 50 students with three variables Gender (Boy/ Girl), Class( IX/ X) and Height (5 to 6 ft). 35 out of these 50 play cricket in leisure time. Now, I want to create a model to predict who will play cricket during leisure period? In this problem, we need to segregate students who play cricket in their leisure time based on highly significant input variable among all three.

This is where decision tree helps, it will segregate the students based on all values of three variable and identify the variable, which creates the best homogeneous sets of students . If  we do a basic math we will see variable Gender is able to identify best  sets compared to the other two variables.


Important terminology associated with decision tree. 

  1. Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.
  2. Splitting: It is a process of dividing a node into two or more sub-nodes.
  3. Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
  4. Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
  5. Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.
  6. Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.
  7. Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.

Decision Trees with package party

I would now like to explain implementation of a decision tree using one of the most common data set in R “iris” data. The aim is to predict the species of flowers consuming the attributes such as length, Width etc. In the package, function ctree() builds a decision tree, and predict() makes prediction for new data. As it is already known before modeling, the  data is split below into two subsets: training (70%) and test (30%). If this concept is not clear may be you can check out the Stanford Course on Machine learning .

output1Please note that the random  seed is set to a fixed value below to make the results reproducible.

We then load package party, build a decision tree, and check the prediction result. Function ctree() provides some parameters, such as MinSplit, MinBusket, MaxSurrogate and MaxDepth, to control the training of decision trees.Currently  we will use default settings to build a decision tree.  In the code below, myFormula specifies that Species is the target variable and all other variables are independent variables.


In the former plot, the barplot for each leaf node shows the probabilities of an instance falling into the three species. In later plot, they are shown as “y” in leaf nodes. For example, node 2 is labeled with “n=40, y=(1, 0, 0)”, which means that it contains 40 training instances and all of them belong to the first class “setosa”. After that, the built tree needs to be tested with test data

 Issues with ctree().

  • First the current version of ctree() (i.e. version 0.9-9995) does not handle missing values well, in that an instance with a missing value may sometimes go to the left sub-tree and sometimes to the right.
  • When a variable exists in training data and is fed into ctree() but does not appear in the built decision tree, the test data must also have that variable to make prediction. Otherwise, a call to predict() would fail.
  •  If the value levels of a categorical variable in test data are different from that in training data, it would also fail to make prediction on the test data.

If you guys feel I should mention a program or a code snipet explaining the issues with ctree() and how possible we can fix them drop me a message and I will do so.

Split Condition in trees

In most of the cases, the discrete splitting functions are uni-variate. Uni-variate means that an internal node is split according to the value of a single attribute. Consequently, the inducer searches for the best attribute upon which to split. There are various uni-variate criteria. These criteria can be characterized in different ways, such as: According to the origin of the measure: information theory, dependence, and distance. According to the measure structure: impurity based criteria, normalized impurity based criteria and Binary criteria.

Impurity Based criteria


Gini Index/ Gini Impurity 

Given a data table that contains attributes and class of the attributes, we measure homogeneity on the basis of classes. We say a table/data set is pure or homogeneous if it contains only a single class. Gini Index helps us measure that.

Gini index says basically says if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure.

Please note that Gini Impurity/Gini Index is not similar to Gini Coefficient (Or at-least I am of the opinion it is not ). A detailed explanation is beautifully given on stack-over flow data science section

Steps to Calculate Gini for a split

  1. Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p^2+q^2), where p=success probability and q= failure probability
  2. Calculate Gini for split using weighted Gini score of each node of that split

Information Gain

Information gain is an impurity-based criterion that uses the entropy measure (origin from information theory) as the impurity measure (Quinlan, 1987). Entropy is nothing but  disorganization in a system. To put the two together Information Gain is a measure to  define the degree of disorganization/Entropy in the system.

Entropy can be calculated using formula:-Entropy, Decision Tree

where p is the probability of success and q is the probability of failure.

A detailed diagram showing the representation of Machine Learning Algorithms.