Introduction to scikit-learn: basic model hyper-parameters tuning

The process of learning a predictive model is driven by a set of internal parameters and a set of training data. These internal parameters are called hyper-parameters and are specific for each family of models. In addition, a specific set of hyper-parameters are optimal for a specific dataset and thus they need to be optimized. In this notebook we will use the words “hyper-parameters” and “parameters” interchangeably

This notebook shows:

  • the influence of changing model hyper-parameters;

  • how to tune these hyper-parameters;

  • how to evaluate the model performance together with hyper-parameter tuning.

import pandas as pd

df = pd.read_csv("../datasets/adult-census.csv")
target_name = "class"
target = df[target_name].to_numpy()
target
array([' <=50K', ' <=50K', ' >50K', ..., ' <=50K', ' <=50K', ' >50K'],
      dtype=object)
data = df.drop(columns=[target_name, "fnlwgt"])
data.head()
age workclass education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 25 Private 11th 7 Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States
1 38 Private HS-grad 9 Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States
2 28 Local-gov Assoc-acdm 12 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States
3 44 Private Some-college 10 Married-civ-spouse Machine-op-inspct Husband Black Male 7688 0 40 United-States
4 18 ? Some-college 10 Never-married ? Own-child White Female 0 0 30 United-States

Once the dataset is loaded, we split it into a training and testing sets.

from sklearn.model_selection import train_test_split

df_train, df_test, target_train, target_test = train_test_split(
    data, target, random_state=42)

Then, we define the preprocessing pipeline to transform differently the numerical and categorical data.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder

categorical_columns = [
    'workclass', 'education', 'marital-status', 'occupation',
    'relationship', 'race', 'native-country', 'sex']

categories = [
    data[column].unique() for column in data[categorical_columns]]

categorical_preprocessor = OrdinalEncoder(categories=categories)

preprocessor = ColumnTransformer([
    ('cat-preprocessor', categorical_preprocessor,
     categorical_columns),], remainder='passthrough',
                                 sparse_threshold=0)

Finally, we use a tree-based classifier (i.e. histogram gradient-boosting) to predict whether or not a person earns more than 50,000 dollars a year.

%%time
# for the moment this line is required to import HistGradientBoostingClassifier
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

model = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier",
     HistGradientBoostingClassifier(random_state=42))])
model.fit(df_train, target_train)

print(
    f"The test accuracy score of the gradient boosting pipeline is: "
    f"{model.score(df_test, target_test):.2f}")
The test accuracy score of the gradient boosting pipeline is: 0.88
CPU times: user 3.1 s, sys: 54.9 ms, total: 3.15 s
Wall time: 912 ms

Quizz

  1. What is the default value of the learning_rate parameter of the HistGradientBoostingClassifier class? (link to the API documentation)

  2. Try to edit the code of the previous cell to set the learning rate parameter to 10. Does this increase the accuracy of the model?

  3. Decrease progressively value of learning_rate: can you find a value that yields an accuracy higher than with the default learning rate?

  4. Fix learning_rate to 0.05 and try setting the value of max_leaf_nodes to the minimum value of 2. Does it improve the accuracy?

  5. Try to progressively increase the value of max_leaf_nodes to 256 by taking powers of 2. What do you observe?

The issue of finding the best model parameters

In the previous example, we created an histogram gradient-boosting classifier using the default parameters by omitting to explicitely set these parameters.

However, there is no reason that these parameters are optimal for our dataset. For instance, fine-tuning the histogram gradient-boosting can be achieved by finding the best combination of the following parameters: (i) learning_rate, (ii) min_samples_leaf, and (iii) max_leaf_nodes. Nevertheless, finding this combination manually will be tedious. Indeed, there are relationship between these parameters which are difficult to find manually: increasing the depth of trees (increasing max_samples_leaf) should be associated with a lower learning-rate.

Scikit-learn provides tools to explore and evaluate the parameters space.

Exercise 1:

By using the previously defined model (called model) and using two nested for loops, make a search for the best combinations of the learning_rate and max_leaf_nodes parameters. In this regard, you will need to train and test the model by setting the parameters. The evaluation of the model should be performed using cross_val_score. You can use the following parameters search:

  • learning_rate for the values 0.05, 0.1, 0.5, 1 and 5

  • max_leaf_nodes for the values 3, 10, 30 and 100

Exercise 2:

  • Build a machine learning pipeline:

    • preprocess the categorical columns using a OneHotEncoder and use a StandardScaler to normalize the numerical data.

    • use a LogisticRegression as a predictive model.

  • Make an hyper-parameters search using RandomizedSearchCV and tuning the parameters:

    • C with values ranging from 0.001 to 10. You can use a reciprocal distribution (i.e. scipy.stats.reciprocal);

    • solver with possible values being "liblinear" and "lbfgs";

    • penalty with possible values being "l2" and "l1";

    • drop with possible values being None or "first".

You might get some FitFailedWarning and try to explain why.

In this notebook, we have:

  • manually tuned the hyper-parameters of a machine-learning pipeline;

  • automatically tuned the hyper-parameters of a machine-learning pipeline by exhaustively searching the best combination from a defined grid;

  • automatically tuned the hyper-parameters of a machine-learning pipeline by drawing values candidates from some predefined distributions;

  • nested an hyper-parameters tuning procedure within a cross-validation evaluation procedure.

Main take-away points

  • a grid-search is a costly exhaustive search and does scale with the number of parameters to search;

  • a randomized-search will always run with a fixed given budget;

  • when assessing the performance of a model, hyper-parameters search should be tuned on the training data of a predifined train test split;

  • alternatively it is possible to nest parameter tuning within a cross-validation scheme.