Exercise 02ΒΆ
The goal is to find the best set of hyper-parameters which maximize the performance on a training set.
Here again with limit the size of the training set to make computation
run faster. Feel free to increase the train_size
value if your computer
is powerful enough.
import numpy as np
import pandas as pd
df = pd.read_csv("../datasets/adult-census.csv")
target_name = "class"
target = df[target_name].to_numpy()
data = df.drop(columns=[target_name, "fnlwgt"])
from sklearn.model_selection import train_test_split
df_train, df_test, target_train, target_test = train_test_split(
data, target, random_state=42)
TODO: create your machine learning pipeline
You should:
preprocess the categorical columns using a
OneHotEncoder
and use aStandardScaler
to normalize the numerical data.use a
LogisticRegression
as a predictive model.
Start by defining the columns and the preprocessing pipelines to be applied on each columns.
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
Subsequently, create a ColumnTransformer
to redirect the specific columns
a preprocessing pipeline.
from sklearn.compose import ColumnTransformer
Finally, concatenate the preprocessing pipeline with a logistic regression.
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
TODO: make your random search
Use a RandomizedSearchCV
to find the best set of hyper-parameters by tuning
the following parameters for the LogisticRegression
model:
C
with values ranging from 0.001 to 10. You can use a reciprocal distribution (i.e.scipy.stats.reciprocal
);solver
with possible values being"liblinear"
and"lbfgs"
;penalty
with possible values being"l2"
and"l1"
;
In addition, try several preprocessing strategies with the OneHotEncoder
by always (or not) dropping the first column when encoding the categorical
data.
Notes: some combinations of the hyper-parameters proposed above are invalid.
You can make the parameter search accept such failures by setting error_score
to np.nan
. The warning messages give more details on which parameter
combinations but the computation will proceed.
Once the computation has completed, print the best combination of parameters
stored in the best_params_
attribute.