📃 Solution for Exercise M6.01

📃 Solution for Exercise M6.01#

The aim of this notebook is to investigate if we can tune the hyperparameters of a bagging regressor and evaluate the gain obtained.

We will load the California housing dataset and split it into a training and a testing set.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(as_frame=True, return_X_y=True)
target *= 100  # rescale the target in k$
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=0, test_size=0.5
)

Note

If you want a deeper overview regarding this dataset, you can refer to the Appendix - Datasets description section at the end of this MOOC.

Create a BaggingRegressor and provide a DecisionTreeRegressor to its parameter estimator. Train the regressor and evaluate its generalization performance on the testing set using the mean absolute error.

# solution
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor

tree = DecisionTreeRegressor()
bagging = BaggingRegressor(estimator=tree, n_jobs=2)
bagging.fit(data_train, target_train)
target_predicted = bagging.predict(data_test)
print(
    "Basic mean absolute error of the bagging regressor:\n"
    f"{mean_absolute_error(target_test, target_predicted):.2f} k$"
)

Basic mean absolute error of the bagging regressor:
36.70 k$

Now, create a RandomizedSearchCV instance using the previous model and tune the important parameters of the bagging regressor. Find the best parameters and check if you are able to find a set of parameters that improve the default regressor still using the mean absolute error as a metric.

Tip

You can list the bagging regressor’s parameters using the get_params method.

# solution
for param in bagging.get_params().keys():
    print(param)

bootstrap
bootstrap_features
estimator__ccp_alpha
estimator__criterion
estimator__max_depth
estimator__max_features
estimator__max_leaf_nodes
estimator__min_impurity_decrease
estimator__min_samples_leaf
estimator__min_samples_split
estimator__min_weight_fraction_leaf
estimator__monotonic_cst
estimator__random_state
estimator__splitter
estimator
max_features
max_samples
n_estimators
n_jobs
oob_score
random_state
verbose
warm_start

from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    "n_estimators": randint(10, 30),
    "max_samples": [0.5, 0.8, 1.0],
    "max_features": [0.5, 0.8, 1.0],
    "estimator__max_depth": randint(3, 10),
}
search = RandomizedSearchCV(
    bagging, param_grid, n_iter=20, scoring="neg_mean_absolute_error"
)
_ = search.fit(data_train, target_train)

import pandas as pd

columns = [f"param_{name}" for name in param_grid.keys()]
columns += ["mean_test_error", "std_test_error"]
cv_results = pd.DataFrame(search.cv_results_)
cv_results["mean_test_error"] = -cv_results["mean_test_score"]
cv_results["std_test_error"] = cv_results["std_test_score"]
cv_results[columns].sort_values(by="mean_test_error")

	param_n_estimators	param_max_samples	param_max_features	param_estimator__max_depth	mean_test_error	std_test_error
13	15	0.8	0.8	9	39.184023	2.279323
12	21	0.5	0.8	9	39.577145	0.866012
8	13	1.0	0.8	8	40.123118	1.418177
2	17	0.5	0.8	7	42.703003	1.338469
7	10	0.8	0.8	7	42.863238	1.363612
3	24	0.8	1.0	7	42.928921	0.915240
15	22	1.0	1.0	7	43.125429	1.333366
0	25	0.8	0.5	9	44.422573	1.340609
18	22	1.0	0.8	6	44.946403	1.175365
6	23	1.0	0.8	6	44.962624	0.727947
5	22	1.0	1.0	6	45.284802	1.256078
10	10	0.8	1.0	6	46.071632	0.945145
4	25	0.8	0.5	7	46.862265	1.044763
1	27	0.8	1.0	5	48.152370	1.255169
16	16	1.0	0.5	7	49.446586	2.231809
9	14	0.8	0.5	5	51.703648	1.034449
11	21	1.0	0.8	4	52.562006	1.399126
14	28	0.5	1.0	3	56.065139	0.976541
17	21	1.0	0.5	4	56.207288	0.910390
19	15	0.8	0.8	3	58.248782	1.534342

target_predicted = search.predict(data_test)
print(
    "Mean absolute error after tuning of the bagging regressor:\n"
    f"{mean_absolute_error(target_test, target_predicted):.2f} k$"
)

Mean absolute error after tuning of the bagging regressor:
40.63 k$

We see that the predictor provided by the bagging regressor does not need much hyperparameter tuning compared to a single decision tree.