Ensemble learning: when many are better that the one¶
In this notebook, we will go in depth into algorithms which combine several simple learners (e.g. decision tree, linear model, etc.). We will see that combining simple learners will result in a more powerful and robust learner. We will focus on two families of ensemble methods:
ensemble using bootstrap (e.g. bagging and random-forest);
ensemble using boosting (e.g. adaptive boosting and gradient-boosting decision tree).
Benefit of ensemble method at a first glance¶
In this section, we will give a quick demonstration on the power of combining several learners instead of fine-tuning a single learner.
We will start by loading the “California Housing” dataset.
In this dataset, we want to predict the median house value in some district in California based on demographic and geographic data.
We start by learning a single decision tree regressor. As we previously
presented in the “tree in depth” notebook, this learner needs to be tuned to
overcome over- or under-fitting. Indeed, the default parameters will not
necessarily lead to an optimal decision tree. Instead of using the default
value, we should search via cross-validation the optimal value of the
important parameters such as max_depth
, min_samples_split
, or
min_samples_leaf
.
We recall that we need to tune these parameters, as decision trees tend to overfit the training data if we grow deep trees, but there are no rules on how to limit the parameters. Thus, not making a search could lead us to have an underfitted model.
First, let’s keep a set of data to test our final model.
We will first make a grid-search to fine-tune the parameters that we mentioned earlier.
We can create a dataframe storing the important information collected during the tuning of the parameters and investigate the results.
From theses results, we can see that the best parameters is the combination where the depth of the tree is not limited and the minimum number of samples to create a leaf is also equal to 1 (the default values) and the minimum number of samples to make a split of 50 (much higher than the default value.
It is interesting to look at the total amount of time it took to fit all these different models. In addition, we can check the performance of the optimal decision tree on the left-out testing data.
Hence, we have a model that has an \(R^2\) score below 0.7. The amount of time to find the best learner depends on the number of folds used during the cross-validation in the grid-search multiplied by the number of parameters. Therefore, the computational cost is quite high.
Now we will use an ensemble method called bagging. More details about this method will be discussed in the next section. In short, this method will use a base regressor (i.e. decision tree regressors) and will train several of them on a slightly modified version of the training set. Then, the predictions of all these base regressors will be combined by averaging.
Here, we will use 50 decision trees and check the fitting time as well as the performance on the left-out testing data. It is important to note that we are not going to tune any parameter of the decision tree.
We can see that the computation time is much shorter for training the full ensemble than for the parameter search of a single tree. In addition, the score is significantly improved with a \(R^2\) close to 0.8. Furthermore, note that this result is obtained before any parameter tuning. This shows the motivation behind the use of an ensemble learner: it gives a relatively good baseline with decent performance without any parameter tuning.
Now, we will discuss in detail two ensemble families: bagging and boosting.
Bagging¶
Bagging stands for Bootstrap AGGregatING. It uses bootstrap samples to learn several models. At predict time, the predictions of each learner are aggregated to give the final predictions.
Let’s define a simple dataset (which we have used before in a previous notebook).
The link between our feature and the target to predict is non-linear. However, a decision tree is capable of fitting such data
Let’s see how we can use bootstraping to learn several trees.
Bootstrap sample¶
A bootstrap sample corresponds to a resampling, with replacement, of the original dataset, a sample that is the same size as the original dataset. Thus, the bootstrap sample will contain some data points several times while some of the original data points will not be present.
We will create a function that given x
and y
will return a bootstrap
sample x_bootstrap
and y_bootstrap
.
We will generate 3 bootstrap samples and qualitatively check the difference with the original dataset.
We observe that the 3 generated bootstrap samples are all different. To confirm this intuition, we can check the number of unique samples in the bootstrap samples.
Theoretically, 63.2% of the original data points of the original dataset will be present in the bootstrap sample. The other 36.8% are just repeated samples.
So, we are able to generate many datasets, all slightly different. Now, we can fit a decision tree to each of these datasets and each decision tree shall be slightly different as well.
We can plot these decision functions on the same plot to see the difference.
Aggregating¶
Once our trees are fitted and we are able to get predictions for each of them, we also need to combine them. In regression, the most straightforward approach is to average the different predictions from all learners. We can plot the averaged predictions from the previous example.
The unbroken red line shows the averaged predictions, which would be the final preditions given by our ‘bag’ of decision tree regressors.
Random forest¶
A popular machine-learning algorithm is the random forest. A Random forest is a modification of the bagging algorithm. In bagging, any classifier or regressor can be used. In a random forest, the base classifier or regressor must be a decision tree. In our previous example, we used a decision tree but we could have used a linear model as the regressor for our bagging algorithm.
In addition, random forest is different from bagging when used with classifiers: when searching for the best split, only a subset of the original features are used. By default, this subset of feature is equal to the square root of the total number of features. In regression, the total number of available features will be used.
We will illustrate the usage of a random forest and compare it with the bagging regressor on the “California housing” dataset.
Notice that we don’t need to provide a base_estimator
parameter to
RandomForestRegressor
, it is always a tree classifier. Also note that the
scores are almost identical. This is because our
problem is a regression problem and therefore, the number of features used
in random forest and bagging is the same.
For classification problems, we would need to pass a tree model instance
with the parameter max_features="sqrt"
to BaggingRegressor
if we wanted
it to have the same behaviour as the random forest classifier.
Classifiers details¶
Up to now, we have only focused on regression problems. There is a little difference between regression and classification.
First, the base_estimator
should be chosen in line with the problem that
is solved: use a classifier with a classification problem and a regressor
with a regression problem.
Then, the aggregation method is different in regression and classification: the averaged predictions is computed in regression while the majority class (weighted by the probabilities) is predicted in classification.
Summary¶
We saw in this section two algorithms that use bootstrap samples to create an ensemble of classifiers or regressors. These algorithms train several learners on different bootstrap samples. The predictions are then aggregated. This operation can be done in a very efficient manner since the training of each learner can be done in parallel.
Boosting¶
We recall that bagging builds an ensemble in a parallel manner: each learner
is trained independently from each other. The idea behind boosting is
different. The ensemble is a sequence of learners where the
Nth
learner requires all previous learners, from 1 to N-1
.
Intuitively, bagging adds learners to the ensemble to correct the mistakes of the previous learners. We will start with an algorithm named Adaptive Boosting (AdaBoost) to get some intuition about the main ideas behind boosting.
Adaptive Boosting (AdaBoost)¶
We will first focus on AdaBoost, which we will use for a classification problem. We will load the “penguin” dataset used in the “tree in depth” notebook. We will predict penguin species from the features culmen length and depth.
In addition, we are also using on the function used the previous “tree in depth” notebook to plot the decision function of the tree.
We will purposely train a shallow decision tree. Since the tree is shallow, it is unlikely to overfit and some of the training examples will even be misclassified on the training set.
We observe that several samples have been misclassified by the classifier.
We mentioned that boosting relies on creating a new classifier which tries to
correct these misclassifications. In scikit-learn, learners support a
parameter sample_weight
which forces the learner to pay more attention to
samples with higher weights, during the training.
This parameter is set when calling
classifier.fit(X, y, sample_weight=weights)
.
We will use this trick to create a new classifier by ‘discarding’ all
correctly classified samples and only considering the misclassified samples.
Thus, mosclassified samples will be assigned a weight of 1 while well
classified samples will assigned to a weight of 0.
We see that the decision function drastically changed. Qualitatively, we see that the previously misclassified samples are now correctly classified.
However, we are making mistakes on previously well classified samples. Thus, we get the intuition that we should weight the predictions of each classifier differently, most probably by using the number of mistakes each classifier is making.
So we could use the classification error to combine both trees.
ensemble_weight = [ (y.shape[0] - len(misclassified_samples_idx)) / y.shape[0], (y.shape[0] - len(newly_misclassified_samples_idx)) / y.shape[0], ] ensemble_weight
The first classifier was 94% accurate and the second one 69% accurate. Therefore, when predicting a class, we should trust the first classifier slightly more than the second one. We could use these accuracy values to weight the predictions of each learner.
To summarize, boosting learns several classifiers, each of which will focus more or less on specific samples of the dataset. Boosting is thus different from bagging: here we never resample our dataset, we just assign different weights to the original dataset.
Boosting requires some strategy to combine the learners together:
one needs to define a way to compute the weights to be assigned to samples;
one needs to assign a weight to each learner when making predictions.
Indeed, we defined a really simple scheme to assign sample weights and learner weights. However, there are statistical theory for how these these sample and learner weights can be optimally calculated. FIXME: I think we should add a reference to ESL here.
We will use the AdaBoost classifier implemented in scikit-learn and look at the underlying decision tree classifiers trained.
We see that AdaBoost has learnt three different classifiers each of which focuses on different samples. Looking at the weights of each learner, we see that the ensemble gives the highest weight to the first classifier. This indeed makes sense when we look at the errors of each classifier. The first classifier also has the highest classification performance.
While AdaBoost is a nice algorithm to demonsrate the internal machinery of boosting algorithms, it is not the most efficient machine-learning algorithm. The most efficient algorithm based on boosting is the gradient-boosting decision tree (GBDT) algorithm which we will discuss now.
Gradient-boosting decision tree (GBDT)¶
Gradient-boosting differs from AdaBoost due to the following reason: instead of assigning weights to specific samples, GBDT will fit a decision tree on the residuals (hence the name “gradient”) of the previous tree. Therefore, each new added tree in the ensemble predicts the error made by the previous learner instead of predicting the target directly.
In this section, we will provide some intuition about the way learners are combined to give the final prediction. In this regard, let’s go back to our regression problem which is more intuitive for demonstrating the underlying machinery.
As we previously discussed, boosting will be based on assembling a sequence of learners. We will start by creating a decision tree regressor. We will fix the depth of the tree so that the resulting learner will underfit the data.
Since the tree underfits the data, its accuracy is far from perfect on the training data. We can observe this in the figure by looking at the difference between the predictions and the ground-truth data. We represent these errors, called “Residuals”, by unbroken red lines.
Indeed, our initial tree was not expressive enough to handle the complexity
of the data, as shown by the residuals.
In a gradient-boosting algorithm, the idea is to create a second tree
which, given the same data x
, will try to predict the residuals instead of
the target y
. We would therefore have a tree that is able to predict the
errors made by the initial tree.
Let’s train such a tree.
We see that this new tree only manages to fit some of the residuals.
We will focus on the last sample in x
and
explain how the predictions of both trees are combined.
For our sample of interest, our initial tree is making an error (small residual). When fitting the second tree, the residual in this case is perfectly fitted and predicted. We will quantitatively check this prediction using the fitted tree. First, let’s check the prediction of the initial tree and compare it with the true value.
As we visually observed, we have a small error. Now, we can use the second tree to try to predict this residual.
We see that our second tree is capable of prediting the exact residual
(error) of our first tree. Therefore, we can predict the value of
x
by summing the prediction of the all trees in the ensemble.
We chose a sample for which only two trees were enough to make the perfect prediction. However, we saw in the previous plot that two trees were not enough to correct the residuals of all samples. Therefore, one needs to add several trees to the ensemble to successfully correct the error.
We will compare the performance of random-forest and gradient boosting on the California housing dataset.
In term of computation performance, the forest can be parallelized and will benefit from the having multiple CPUs. In terms of scoring performance, both algorithms lead to very close results.
Parameter consideration with random forest and gradient-boosting¶
In the previous section, we did not discuss the parameters of random forest and gradient-boosting. However, there are a couple of things to keep in mind when setting these parameters.
Random forest¶
The main parameter to tune with random forest is the n_estimators
parameter. In general, the more trees in the forest, the better the
performance will be. However, it will slow down the fitting and prediction
time. So one has to balance compute time and performance when setting the
number of estimators when putting such learner in production.
The max_depth
parameter could also be tuned. Sometimes, there is no need
to have fully grown trees. However, be aware that with random forest, trees
are generally deep since we are seeking to overfit the learners on the
bootstrap samples because this will be mitigated by combining them.
Assembling underfitted trees (i.e. shallow trees) might also lead to an
underfitted forest.
We can observe that in our grid-search, the largest max_depth
together with
largest n_estimators
led to the best performance.
Gradient-boosting decision tree¶
For gradient-boosting, parameter tuning is a combination of several
parameters instead of setting one after the other each parameter. The
important parameters are n_estimators
, max_depth
, and learning_rate
.
Let’s first discuss the max_depth
parameter. We saw in the section on
gradient-boosting that the algorithm fits the error of the previous tree
in the ensemble. Thus, fitting fully grown trees will be detrimental. Indeed,
the first tree of the ensemble would perfectly fit (overfit) the data and
thus no subsequent tree would be required, since there would be no residuals.
Therefore, the tree used in gradient-boosting should have a low depth,
typically between 3 to 8 levels.
With this consideration in mind, the deeper the trees, the faster the
residuals will be corrected and less learners are required. So n_estimators
should be increased if max_depth
is lower.
Finally, we have overlooked the impact of the learning_rate
parameter up
till now. When
fitting the residuals one could choose if the tree should try to correct all
possible errors or only a fraction of them. The learning-rate allows you to
control this behaviour. A small learning-rate value would only correct the
residuals of very few samples. If a large learning-rate is set (e.g., 1),
we would fit the residuals of all samples. So, with a very low
learning-rate, we will need more estimators to correct the overall error.
However, a too large learning-rate tends to obtain an overfitted ensemble,
similar to having a too large tree depth.
Accelerating gradient-boosting¶
We previously mentioned that random-forest is an efficient algorithm since each tree of the ensemble can be fitted at the same time independently. Therefore, the algorithm scales efficiently with both the number of CPUs and the number of samples.
In gradient-boosting, the algorithm is a sequential algorithm. It requires
the N-1
trees to have been fit to be able to fit the tree at stage N
.
Therefore, the algorithm is
quite computationally expensive. The most expensive part in this algorithm is
the search for the best split in the tree which is a brute-force
approach: all possible split are evaluated and the best one is picked. We
explained this process in the notebook “tree in depth”, which
you can refer to.
To accelerate the gradient-boosting algorithm, one could reduce the number of splits to be evaluated. As a consequence, the performance of such a tree would be reduced. However, since we are combining several trees in a gradient-boosting, we can add more estimators to overcome this issue.
This algorithm is called HistGradientBoostingClassifier
and
HistGradientBoostingRegressor
. Each feature in the dataset X
is first
binned by computing histograms which are later used to evaluate the potential
splits. The number
of splits to evaluate is then much smaller. This algorithm becomes much more
efficient than gradient bossting when the dataset has 10,000+ samples.
Below we will give an example of a large dataset and we can compare computation time with the earlier experiment in the previous section.
The histogram gradient-boosting is the best algorithm in terms of score. It will also scale when the number of samples increases, while the normal gradient-boosting will not.
Wrap-up¶
So in this notebook we discussed ensemble learners which are a type of learner that combines simpler learners together. We saw two strategies: one based on bootstrap samples allowing learners to be fit in a parallel manner and the other called boosting which fit learners in a sequential manner.
From these two families, we mainly focused on giving intuitions regarding the internal machinery of the random forest and gradient-boosting algorithms which are state-of-the-art methods.