Exercise 02ΒΆ
The goal is to find the best set of hyper-parameters which maximize the performance on a training set.
Here again with limit the size of the training set to make computation
run faster. Feel free to increase the train_size
value if your computer
is powerful enough.
import numpy as np
import pandas as pd
df = pd.read_csv("../datasets/adult-census.csv")
target_name = "class"
target = df[target_name].to_numpy()
data = df.drop(columns=[target_name, "fnlwgt"])
from sklearn.model_selection import train_test_split
df_train, df_test, target_train, target_test = train_test_split(
data, target, train_size=0.2, random_state=42)
TODO: create your machine learning pipeline
You should:
preprocess the categorical columns using a
OneHotEncoder
and use aStandardScaler
to normalize the numerical data.use a
LogisticRegression
as a predictive model.
Start by defining the columns and the preprocessing pipelines to be applied on each columns.
categorical_columns = [
'workclass', 'education', 'marital-status', 'occupation',
'relationship', 'race', 'native-country', 'sex']
categories = [data[column].unique()
for column in data[categorical_columns]]
numerical_columns = [
'age', 'capital-gain', 'capital-loss', 'hours-per-week']
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
categorical_processor = OneHotEncoder(categories=categories)
numerical_processor = StandardScaler()
Subsequently, create a ColumnTransformer
to redirect the specific columns
a preprocessing pipeline.
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
[('cat-preprocessor', categorical_processor, categorical_columns),
('num-preprocessor', numerical_processor, numerical_columns)]
)
Finally, concatenate the preprocessing pipeline with a logistic regression.
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
model = make_pipeline(preprocessor, LogisticRegression())
Use a RandomizedSearchCV
to find the best set of hyper-parameters by tuning
the following parameters for the LogisticRegression
model:
C
with values ranging from 0.001 to 10. You can use a reciprocal distribution (i.e.scipy.stats.reciprocal
);solver
with possible values being"liblinear"
and"lbfgs"
;penalty
with possible values being"l2"
and"l1"
;
In addition, try several preprocessing strategies with the OneHotEncoder
by always (or not) dropping the first column when encoding the categorical
data.
Notes: some combinations of the hyper-parameters proposed above are invalid.
You can make the parameter search accept such failures by setting error_score
to np.nan
. The warning messages give more details on which parameter
combinations but the computation will proceed.
Once the computation has completed, print the best combination of parameters
stored in the best_params_
attribute.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import reciprocal
param_distributions = {
"logisticregression__C": reciprocal(0.001, 10),
"logisticregression__solver": ["liblinear", "lbfgs"],
"logisticregression__penalty": ["l2", "l1"],
"columntransformer__cat-preprocessor__drop": [None, "first"]
}
model_random_search = RandomizedSearchCV(
model, param_distributions=param_distributions,
n_iter=20, error_score=np.nan, n_jobs=2, verbose=1)
model_random_search.fit(df_train, target_train)
model_random_search.best_params_
Fitting 5 folds for each of 20 candidates, totalling 100 fits
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed: 4.7s finished
{'columntransformer__cat-preprocessor__drop': 'first',
'logisticregression__C': 0.3868451621673283,
'logisticregression__penalty': 'l1',
'logisticregression__solver': 'liblinear'}
We could use cv_results = model_random_search.cv_results_
in the plot at
the end of this notebook (you are more than welcome to try!). Instead we are
going to load the results obtained from a similar search with many more
iterations (200 instead of 20).
This way we can have a more detailed plot while being able to run this notebook in a reasonably short amount of time.
# Uncomment this cell if you want to regenerate the results csv file. This
# can take a long time to execute.
#
# model_random_search = RandomizedSearchCV(
# model, param_distributions=param_distributions,
# n_iter=200, error_score=np.nan, n_jobs=-1)
# _ = model_random_search.fit(df_train, target_train)
# cv_results = pd.DataFrame(model_random_search.cv_results_)
# cv_results.to_csv("../figures/randomized_search_results_logistic_regression.csv")
cv_results = pd.read_csv(
"../figures/randomized_search_results_logistic_regression.csv",
index_col=0)
column_results = [f"param_{name}"for name in param_distributions.keys()]
column_results += ["mean_test_score", "std_test_score", "rank_test_score"]
cv_results = cv_results[column_results].sort_values(
"mean_test_score", ascending=False)
cv_results = (
cv_results
.rename(columns={
"param_logisticregression__C": "C",
"param_logisticregression__solver": "solver",
"param_logisticregression__penalty": "penalty",
"param_columntransformer__cat-preprocessor__drop": "drop",
"mean_test_score": "mean test accuracy",
"rank_test_score": "ranking"})
.astype(dtype={'C': 'float64'})
)
cv_results['log C'] = np.log(cv_results['C'])
cv_results["drop"] = cv_results["drop"].fillna("None")
cv_results = cv_results.dropna("index").drop(columns=["solver"])
encoding = {}
for col in cv_results:
if cv_results[col].dtype.kind == 'O':
labels, uniques = pd.factorize(cv_results[col])
cv_results[col] = labels
encoding[col] = uniques
encoding
{'penalty': Index(['l2', 'l1'], dtype='object'),
'drop': Index(['None', 'first'], dtype='object')}
import plotly.express as px
fig = px.parallel_coordinates(
cv_results.drop(columns=["ranking", "std_test_score"]),
color="mean test accuracy",
dimensions=["log C", "penalty", "drop",
"mean test accuracy"],
color_continuous_scale=px.colors.diverging.Tealrose,
)
fig.show()