Exercise 02¶

The goal is to find the best set of hyper-parameters which maximize the performance on a training set.

Here again with limit the size of the training set to make computation run faster. Feel free to increase the train_size value if your computer is powerful enough.

import numpy as np
import pandas as pd

df = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = df[target_name].to_numpy()
data = df.drop(columns=[target_name, "fnlwgt"])

from sklearn.model_selection import train_test_split

df_train, df_test, target_train, target_test = train_test_split(
    data, target, train_size=0.2, random_state=42)

TODO: create your machine learning pipeline

You should:

preprocess the categorical columns using a OneHotEncoder and use a StandardScaler to normalize the numerical data.
use a LogisticRegression as a predictive model.

Start by defining the columns and the preprocessing pipelines to be applied on each columns.

categorical_columns = [
    'workclass', 'education', 'marital-status', 'occupation',
    'relationship', 'race', 'native-country', 'sex']

categories = [data[column].unique()
              for column in data[categorical_columns]]

numerical_columns = [
    'age', 'capital-gain', 'capital-loss', 'hours-per-week']

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

categorical_processor = OneHotEncoder(categories=categories)
numerical_processor = StandardScaler()

Subsequently, create a ColumnTransformer to redirect the specific columns a preprocessing pipeline.

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    [('cat-preprocessor', categorical_processor, categorical_columns),
     ('num-preprocessor', numerical_processor, numerical_columns)]
)

Finally, concatenate the preprocessing pipeline with a logistic regression.

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

model = make_pipeline(preprocessor, LogisticRegression())

Use a RandomizedSearchCV to find the best set of hyper-parameters by tuning the following parameters for the LogisticRegression model:

C with values ranging from 0.001 to 10. You can use a reciprocal distribution (i.e. scipy.stats.reciprocal);
solver with possible values being "liblinear" and "lbfgs";
penalty with possible values being "l2" and "l1";

In addition, try several preprocessing strategies with the OneHotEncoder by always (or not) dropping the first column when encoding the categorical data.

Notes: some combinations of the hyper-parameters proposed above are invalid. You can make the parameter search accept such failures by setting error_score to np.nan. The warning messages give more details on which parameter combinations but the computation will proceed.

Once the computation has completed, print the best combination of parameters stored in the best_params_ attribute.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import reciprocal

param_distributions = {
    "logisticregression__C": reciprocal(0.001, 10),
    "logisticregression__solver": ["liblinear", "lbfgs"],
    "logisticregression__penalty": ["l2", "l1"],
    "columntransformer__cat-preprocessor__drop": [None, "first"]
}

model_random_search = RandomizedSearchCV(
    model, param_distributions=param_distributions,
    n_iter=20, error_score=np.nan, n_jobs=2, verbose=1)
model_random_search.fit(df_train, target_train)
model_random_search.best_params_

Fitting 5 folds for each of 20 candidates, totalling 100 fits

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.

[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    4.7s finished

{'columntransformer__cat-preprocessor__drop': 'first',
 'logisticregression__C': 0.3868451621673283,
 'logisticregression__penalty': 'l1',
 'logisticregression__solver': 'liblinear'}

We could use cv_results = model_random_search.cv_results_ in the plot at the end of this notebook (you are more than welcome to try!). Instead we are going to load the results obtained from a similar search with many more iterations (200 instead of 20).

This way we can have a more detailed plot while being able to run this notebook in a reasonably short amount of time.

# Uncomment this cell if you want to regenerate the results csv file. This
# can take a long time to execute.
#
# model_random_search = RandomizedSearchCV(
#     model, param_distributions=param_distributions,
#     n_iter=200, error_score=np.nan, n_jobs=-1)
# _ = model_random_search.fit(df_train, target_train)
# cv_results =  pd.DataFrame(model_random_search.cv_results_)
# cv_results.to_csv("../figures/randomized_search_results_logistic_regression.csv")

cv_results = pd.read_csv(
    "../figures/randomized_search_results_logistic_regression.csv",
    index_col=0)

column_results = [f"param_{name}"for name in param_distributions.keys()]
column_results += ["mean_test_score", "std_test_score", "rank_test_score"]

cv_results = cv_results[column_results].sort_values(
    "mean_test_score", ascending=False)
cv_results = (
    cv_results
    .rename(columns={
        "param_logisticregression__C": "C",
         "param_logisticregression__solver": "solver",
         "param_logisticregression__penalty": "penalty",
         "param_columntransformer__cat-preprocessor__drop": "drop",
         "mean_test_score": "mean test accuracy",
         "rank_test_score": "ranking"})
    .astype(dtype={'C': 'float64'})
)
cv_results['log C'] = np.log(cv_results['C'])

cv_results["drop"] = cv_results["drop"].fillna("None")
cv_results = cv_results.dropna("index").drop(columns=["solver"])
encoding = {}
for col in cv_results:
    if cv_results[col].dtype.kind == 'O':
        labels, uniques = pd.factorize(cv_results[col])
        cv_results[col] = labels
        encoding[col] = uniques
encoding

{'penalty': Index(['l2', 'l1'], dtype='object'),
 'drop': Index(['None', 'first'], dtype='object')}

import plotly.express as px

fig = px.parallel_coordinates(
    cv_results.drop(columns=["ranking", "std_test_score"]),
    color="mean test accuracy",
    dimensions=["log C", "penalty", "drop",
                "mean test accuracy"],
    color_continuous_scale=px.colors.diverging.Tealrose,
)
fig.show()

Scikit-learn tutorial

Exercise 02¶