What is out of bag error in Random Forests?

Language Agnostic Machine Learning Classification Random Forest

Language Agnostic Problem Overview

What is out of bag error in Random Forests? Is it the optimal parameter for finding the right number of trees in a Random Forest?

Language Agnostic Solutions

Solution 1 - Language Agnostic

I will take an attempt to explain:

Suppose our training data set is represented by T and suppose data set has M features (or attributes or variables).

T = {(X1,y1), (X2,y2), ... (Xn, yn)}

and

Xi is input vector {xi1, xi2, ... xiM}

yi is the label (or output or class).

summary of RF:

Random Forests algorithm is a classifier based on primarily two methods -

Bagging
Random subspace method.

Suppose we decide to have S number of trees in our forest then we first create S datasets of "same size as original" created from random resampling of data in T with-replacement (n times for each dataset). This will result in {T1, T2, ... TS} datasets. Each of these is called a bootstrap dataset. Due to "with-replacement" every dataset Ti can have duplicate data records and Ti can be missing several data records from original datasets. This is called Bootstrapping. (en.wikipedia.org/wiki/Bootstrapping_(statistics))

Bagging is the process of taking bootstraps & then aggregating the models learned on each bootstrap.

Now, RF creates S trees and uses m (=sqrt(M) or =floor(lnM+1)) random subfeatures out of M possible features to create any tree. This is called random subspace method.

So for each Ti bootstrap dataset you create a tree Ki. If you want to classify some input data D = {x1, x2, ..., xM} you let it pass through each tree and produce S outputs (one for each tree) which can be denoted by Y = {y1, y2, ..., ys}. Final prediction is a majority vote on this set.

Out-of-bag error:

After creating the classifiers (S trees), for each (Xi,yi) in the original training set i.e. T, select all Tk which does not include (Xi,yi). This subset, pay attention, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples. There are n such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of votes ONLY over Tk such that it does not contain (xi,yi).

Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set (compare it with known yi's).

Why is it important?

> The study of error estimates for bagged classifiers in Breiman > [1996b], gives empirical evidence to show that the out-of-bag estimate > is as accurate as using a test set of the same size as the training > set. Therefore, using the out-of-bag error estimate removes the need > for a set aside test set.¹

(Thanks @Rudolf for corrections. His comments below.)

Solution 2 - Language Agnostic

In Breiman's original implementation of the random forest algorithm, each tree is trained on about 2/3 of the total training data. As the forest is built, each tree can thus be tested (similar to leave one out cross validation) on the samples not used in building that tree. This is the out of bag error estimate - an internal error estimate of a random forest as it is being constructed.

Content Type	Original Author	Original Content on Stackoverflow
Question	csalive	View Question on Stackoverflow
Solution 1 - Language Agnostic	Manoj Awasthi	View Answer on Stackoverflow
Solution 2 - Language Agnostic	eagle34	View Answer on Stackoverflow

What is out of bag error in Random Forests?

Language Agnostic Problem Overview

Language Agnostic Solutions

Solution 1 - Language Agnostic

Solution 2 - Language Agnostic

Missing .map resource?

Why is using BufferedInputStream to read a file byte by byte faster than using FileInputStream?

Attributions