Ensemble learning: when many are better that the one

In this notebook, we will go in depth into algorithms which combine several simple learners (e.g. decision tree, linear model, etc.). We will see that combining simple learners will result in a more powerful and robust learner. We will focus on two families of ensemble methods:

  • ensemble using bootstrap (e.g. bagging and random-forest);

  • ensemble using boosting (e.g. adaptive boosting and gradient-boosting decision tree).

Benefit of ensemble method at a first glance

In this section, we will give a quick demonstration on the power of combining several learners instead of fine-tuning a single learner.

We will start by loading the “California Housing” dataset.

In this dataset, we want to predict the median house value in some district in California based on demographic and geographic data.

We start by learning a single decision tree regressor. As we previously presented in the “tree in depth” notebook, this learner needs to be tuned to overcome over- or under-fitting. Indeed, the default parameters will not necessarily lead to an optimal decision tree. Instead of using the default value, we should search via cross-validation the optimal value of the important parameters such as max_depth, min_samples_split, or min_samples_leaf.

We recall that we need to tune these parameters, as decision trees tend to overfit the training data if we grow deep trees, but there are no rules on how to limit the parameters. Thus, not making a search could lead us to have an underfitted model.

First, let’s keep a set of data to test our final model.

We will first make a grid-search to fine-tune the parameters that we mentioned earlier.

We can create a dataframe storing the important information collected during the tuning of the parameters and investigate the results.

From theses results, we can see that the best parameters is the combination where the depth of the tree is not limited and the minimum number of samples to create a leaf is also equal to 1 (the default values) and the minimum number of samples to make a split of 50 (much higher than the default value.

It is interesting to look at the total amount of time it took to fit all these different models. In addition, we can check the performance of the optimal decision tree on the left-out testing data.

Hence, we have a model that has an \(R^2\) score below 0.7. The amount of time to find the best learner depends on the number of folds used during the cross-validation in the grid-search multiplied by the number of parameters. Therefore, the computational cost is quite high.

Now we will use an ensemble method called bagging. More details about this method will be discussed in the next section. In short, this method will use a base regressor (i.e. decision tree regressors) and will train several of them on a slightly modified version of the training set. Then, the predictions of all these base regressors will be combined by averaging.

Here, we will use 50 decision trees and check the fitting time as well as the performance on the left-out testing data. It is important to note that we are not going to tune any parameter of the decision tree.

We can see that the computation time is much shorter for training the full ensemble than for the parameter search of a single tree. In addition, the score is significantly improved with a \(R^2\) close to 0.8. Furthermore, note that this result is obtained before any parameter tuning. This shows the motivation behind the use of an ensemble learner: it gives a relatively good baseline with decent performance without any parameter tuning.

Now, we will discuss in detail two ensemble families: bagging and boosting.

Bagging

Bagging stands for Bootstrap AGGregatING. It uses bootstrap samples to learn several models. At predict time, the predictions of each learner are aggregated to give the final predictions.

Let’s define a simple dataset (which we have used before in a previous notebook).

The link between our feature and the target to predict is non-linear. However, a decision tree is capable of fitting such data

Let’s see how we can use bootstraping to learn several trees.

Bootstrap sample

A bootstrap sample corresponds to a resampling, with replacement, of the original dataset, a sample that is the same size as the original dataset. Thus, the bootstrap sample will contain some data points several times while some of the original data points will not be present.

We will create a function that given x and y will return a bootstrap sample x_bootstrap and y_bootstrap.

We will generate 3 bootstrap samples and qualitatively check the difference with the original dataset.

We observe that the 3 generated bootstrap samples are all different. To confirm this intuition, we can check the number of unique samples in the bootstrap samples.

Theoretically, 63.2% of the original data points of the original dataset will be present in the bootstrap sample. The other 36.8% are just repeated samples.

So, we are able to generate many datasets, all slightly different. Now, we can fit a decision tree to each of these datasets and each decision tree shall be slightly different as well.

We can plot these decision functions on the same plot to see the difference.

Aggregating

Once our trees are fitted and we are able to get predictions for each of them, we also need to combine them. In regression, the most straightforward approach is to average the different predictions from all learners. We can plot the averaged predictions from the previous example.

The unbroken red line shows the averaged predictions, which would be the final preditions given by our ‘bag’ of decision tree regressors.

Random forest

A popular machine-learning algorithm is the random forest. A Random forest is a modification of the bagging algorithm. In bagging, any classifier or regressor can be used. In a random forest, the base classifier or regressor must be a decision tree. In our previous example, we used a decision tree but we could have used a linear model as the regressor for our bagging algorithm.

In addition, random forest is different from bagging when used with classifiers: when searching for the best split, only a subset of the original features are used. By default, this subset of feature is equal to the square root of the total number of features. In regression, the total number of available features will be used.

We will illustrate the usage of a random forest and compare it with the bagging regressor on the “California housing” dataset.

Notice that we don’t need to provide a base_estimator parameter to RandomForestRegressor, it is always a tree classifier. Also note that the scores are almost identical. This is because our problem is a regression problem and therefore, the number of features used in random forest and bagging is the same.

For classification problems, we would need to pass a tree model instance with the parameter max_features="sqrt" to BaggingRegressor if we wanted it to have the same behaviour as the random forest classifier.

Classifiers details

Up to now, we have only focused on regression problems. There is a little difference between regression and classification.

First, the base_estimator should be chosen in line with the problem that is solved: use a classifier with a classification problem and a regressor with a regression problem.

Then, the aggregation method is different in regression and classification: the averaged predictions is computed in regression while the majority class (weighted by the probabilities) is predicted in classification.

Summary

We saw in this section two algorithms that use bootstrap samples to create an ensemble of classifiers or regressors. These algorithms train several learners on different bootstrap samples. The predictions are then aggregated. This operation can be done in a very efficient manner since the training of each learner can be done in parallel.

Boosting

We recall that bagging builds an ensemble in a parallel manner: each learner is trained independently from each other. The idea behind boosting is different. The ensemble is a sequence of learners where the Nth learner requires all previous learners, from 1 to N-1.

Intuitively, bagging adds learners to the ensemble to correct the mistakes of the previous learners. We will start with an algorithm named Adaptive Boosting (AdaBoost) to get some intuition about the main ideas behind boosting.

Adaptive Boosting (AdaBoost)

We will first focus on AdaBoost, which we will use for a classification problem. We will load the “penguin” dataset used in the “tree in depth” notebook. We will predict penguin species from the features culmen length and depth.

In addition, we are also using on the function used the previous “tree in depth” notebook to plot the decision function of the tree.

We will purposely train a shallow decision tree. Since the tree is shallow, it is unlikely to overfit and some of the training examples will even be misclassified on the training set.

We observe that several samples have been misclassified by the classifier.

We mentioned that boosting relies on creating a new classifier which tries to correct these misclassifications. In scikit-learn, learners support a parameter sample_weight which forces the learner to pay more attention to samples with higher weights, during the training.

This parameter is set when calling classifier.fit(X, y, sample_weight=weights). We will use this trick to create a new classifier by ‘discarding’ all correctly classified samples and only considering the misclassified samples. Thus, mosclassified samples will be assigned a weight of 1 while well classified samples will assigned to a weight of 0.

We see that the decision function drastically changed. Qualitatively, we see that the previously misclassified samples are now correctly classified.

However, we are making mistakes on previously well classified samples. Thus, we get the intuition that we should weight the predictions of each classifier differently, most probably by using the number of mistakes each classifier is making.

So we could use the classification error to combine both trees.

ensemble_weight = [ (y.shape[0] - len(misclassified_samples_idx)) / y.shape[0], (y.shape[0] - len(newly_misclassified_samples_idx)) / y.shape[0], ] ensemble_weight

The first classifier was 94% accurate and the second one 69% accurate. Therefore, when predicting a class, we should trust the first classifier slightly more than the second one. We could use these accuracy values to weight the predictions of each learner.

To summarize, boosting learns several classifiers, each of which will focus more or less on specific samples of the dataset. Boosting is thus different from bagging: here we never resample our dataset, we just assign different weights to the original dataset.

Boosting requires some strategy to combine the learners together:

  • one needs to define a way to compute the weights to be assigned to samples;

  • one needs to assign a weight to each learner when making predictions.

Indeed, we defined a really simple scheme to assign sample weights and learner weights. However, there are statistical theory for how these these sample and learner weights can be optimally calculated. FIXME: I think we should add a reference to ESL here.

We will use the AdaBoost classifier implemented in scikit-learn and look at the underlying decision tree classifiers trained.

We see that AdaBoost has learnt three different classifiers each of which focuses on different samples. Looking at the weights of each learner, we see that the ensemble gives the highest weight to the first classifier. This indeed makes sense when we look at the errors of each classifier. The first classifier also has the highest classification performance.

While AdaBoost is a nice algorithm to demonsrate the internal machinery of boosting algorithms, it is not the most efficient machine-learning algorithm. The most efficient algorithm based on boosting is the gradient-boosting decision tree (GBDT) algorithm which we will discuss now.

Gradient-boosting decision tree (GBDT)

Gradient-boosting differs from AdaBoost due to the following reason: instead of assigning weights to specific samples, GBDT will fit a decision tree on the residuals (hence the name “gradient”) of the previous tree. Therefore, each new added tree in the ensemble predicts the error made by the previous learner instead of predicting the target directly.

In this section, we will provide some intuition about the way learners are combined to give the final prediction. In this regard, let’s go back to our regression problem which is more intuitive for demonstrating the underlying machinery.

As we previously discussed, boosting will be based on assembling a sequence of learners. We will start by creating a decision tree regressor. We will fix the depth of the tree so that the resulting learner will underfit the data.

Since the tree underfits the data, its accuracy is far from perfect on the training data. We can observe this in the figure by looking at the difference between the predictions and the ground-truth data. We represent these errors, called “Residuals”, by unbroken red lines.

Indeed, our initial tree was not expressive enough to handle the complexity of the data, as shown by the residuals. In a gradient-boosting algorithm, the idea is to create a second tree which, given the same data x, will try to predict the residuals instead of the target y. We would therefore have a tree that is able to predict the errors made by the initial tree.

Let’s train such a tree.

We see that this new tree only manages to fit some of the residuals. We will focus on the last sample in x and explain how the predictions of both trees are combined.

For our sample of interest, our initial tree is making an error (small residual). When fitting the second tree, the residual in this case is perfectly fitted and predicted. We will quantitatively check this prediction using the fitted tree. First, let’s check the prediction of the initial tree and compare it with the true value.

As we visually observed, we have a small error. Now, we can use the second tree to try to predict this residual.

We see that our second tree is capable of prediting the exact residual (error) of our first tree. Therefore, we can predict the value of x by summing the prediction of the all trees in the ensemble.

We chose a sample for which only two trees were enough to make the perfect prediction. However, we saw in the previous plot that two trees were not enough to correct the residuals of all samples. Therefore, one needs to add several trees to the ensemble to successfully correct the error.

We will compare the performance of random-forest and gradient boosting on the California housing dataset.

In term of computation performance, the forest can be parallelized and will benefit from the having multiple CPUs. In terms of scoring performance, both algorithms lead to very close results.

Parameter consideration with random forest and gradient-boosting

In the previous section, we did not discuss the parameters of random forest and gradient-boosting. However, there are a couple of things to keep in mind when setting these parameters.

Random forest

The main parameter to tune with random forest is the n_estimators parameter. In general, the more trees in the forest, the better the performance will be. However, it will slow down the fitting and prediction time. So one has to balance compute time and performance when setting the number of estimators when putting such learner in production.

The max_depth parameter could also be tuned. Sometimes, there is no need to have fully grown trees. However, be aware that with random forest, trees are generally deep since we are seeking to overfit the learners on the bootstrap samples because this will be mitigated by combining them. Assembling underfitted trees (i.e. shallow trees) might also lead to an underfitted forest.

We can observe that in our grid-search, the largest max_depth together with largest n_estimators led to the best performance.

Gradient-boosting decision tree

For gradient-boosting, parameter tuning is a combination of several parameters instead of setting one after the other each parameter. The important parameters are n_estimators, max_depth, and learning_rate.

Let’s first discuss the max_depth parameter. We saw in the section on gradient-boosting that the algorithm fits the error of the previous tree in the ensemble. Thus, fitting fully grown trees will be detrimental. Indeed, the first tree of the ensemble would perfectly fit (overfit) the data and thus no subsequent tree would be required, since there would be no residuals. Therefore, the tree used in gradient-boosting should have a low depth, typically between 3 to 8 levels.

With this consideration in mind, the deeper the trees, the faster the residuals will be corrected and less learners are required. So n_estimators should be increased if max_depth is lower.

Finally, we have overlooked the impact of the learning_rate parameter up till now. When fitting the residuals one could choose if the tree should try to correct all possible errors or only a fraction of them. The learning-rate allows you to control this behaviour. A small learning-rate value would only correct the residuals of very few samples. If a large learning-rate is set (e.g., 1), we would fit the residuals of all samples. So, with a very low learning-rate, we will need more estimators to correct the overall error. However, a too large learning-rate tends to obtain an overfitted ensemble, similar to having a too large tree depth.

Accelerating gradient-boosting

We previously mentioned that random-forest is an efficient algorithm since each tree of the ensemble can be fitted at the same time independently. Therefore, the algorithm scales efficiently with both the number of CPUs and the number of samples.

In gradient-boosting, the algorithm is a sequential algorithm. It requires the N-1 trees to have been fit to be able to fit the tree at stage N. Therefore, the algorithm is quite computationally expensive. The most expensive part in this algorithm is the search for the best split in the tree which is a brute-force approach: all possible split are evaluated and the best one is picked. We explained this process in the notebook “tree in depth”, which you can refer to.

To accelerate the gradient-boosting algorithm, one could reduce the number of splits to be evaluated. As a consequence, the performance of such a tree would be reduced. However, since we are combining several trees in a gradient-boosting, we can add more estimators to overcome this issue.

This algorithm is called HistGradientBoostingClassifier and HistGradientBoostingRegressor. Each feature in the dataset X is first binned by computing histograms which are later used to evaluate the potential splits. The number of splits to evaluate is then much smaller. This algorithm becomes much more efficient than gradient bossting when the dataset has 10,000+ samples.

Below we will give an example of a large dataset and we can compare computation time with the earlier experiment in the previous section.

The histogram gradient-boosting is the best algorithm in terms of score. It will also scale when the number of samples increases, while the normal gradient-boosting will not.

Wrap-up

So in this notebook we discussed ensemble learners which are a type of learner that combines simpler learners together. We saw two strategies: one based on bootstrap samples allowing learners to be fit in a parallel manner and the other called boosting which fit learners in a sequential manner.

From these two families, we mainly focused on giving intuitions regarding the internal machinery of the random forest and gradient-boosting algorithms which are state-of-the-art methods.