Solution for Exercise 02¶
The goal of this exercise is to evalutate the impact of using an arbitrary integer encoding for categorical variables along with a linear classification model such as Logistic Regression.
To do so, let’s try to use OrdinalEncoder
to preprocess the categorical
variables. This preprocessor is assembled in a pipeline with
LogisticRegression
. The performance of the pipeline can be evaluated as
usual by cross-validation and then compared to the score obtained when using
OneHotEncoding
or to some other baseline score.
Because OrdinalEncoder
can raise errors if it sees an unknown category at
prediction time, we need to pre-compute the list of all possible categories
ahead of time:
categories = [data[column].unique()
for column in data[categorical_columns]]
OrdinalEncoder(categories=categories)
import pandas as pd
df = pd.read_csv("../datasets/adult-census.csv")
target_name = "class"
target = df[target_name].to_numpy()
data = df.drop(columns=[target_name, "fnlwgt"])
from sklearn.compose import make_column_selector as selector
categorical_columns_selector = selector(dtype_exclude=["int", "float"])
categorical_columns = categorical_columns_selector(data)
data_categorical = data[categorical_columns]
categories = [
data[column].unique() for column in data[categorical_columns]]
categories
[array([' Private', ' Local-gov', ' ?', ' Self-emp-not-inc',
' Federal-gov', ' State-gov', ' Self-emp-inc', ' Without-pay',
' Never-worked'], dtype=object),
array([' 11th', ' HS-grad', ' Assoc-acdm', ' Some-college', ' 10th',
' Prof-school', ' 7th-8th', ' Bachelors', ' Masters', ' Doctorate',
' 5th-6th', ' Assoc-voc', ' 9th', ' 12th', ' 1st-4th',
' Preschool'], dtype=object),
array([' Never-married', ' Married-civ-spouse', ' Widowed', ' Divorced',
' Separated', ' Married-spouse-absent', ' Married-AF-spouse'],
dtype=object),
array([' Machine-op-inspct', ' Farming-fishing', ' Protective-serv', ' ?',
' Other-service', ' Prof-specialty', ' Craft-repair',
' Adm-clerical', ' Exec-managerial', ' Tech-support', ' Sales',
' Priv-house-serv', ' Transport-moving', ' Handlers-cleaners',
' Armed-Forces'], dtype=object),
array([' Own-child', ' Husband', ' Not-in-family', ' Unmarried', ' Wife',
' Other-relative'], dtype=object),
array([' Black', ' White', ' Asian-Pac-Islander', ' Other',
' Amer-Indian-Eskimo'], dtype=object),
array([' Male', ' Female'], dtype=object),
array([' United-States', ' ?', ' Peru', ' Guatemala', ' Mexico',
' Dominican-Republic', ' Ireland', ' Germany', ' Philippines',
' Thailand', ' Haiti', ' El-Salvador', ' Puerto-Rico', ' Vietnam',
' South', ' Columbia', ' Japan', ' India', ' Cambodia', ' Poland',
' Laos', ' England', ' Cuba', ' Taiwan', ' Italy', ' Canada',
' Portugal', ' China', ' Nicaragua', ' Honduras', ' Iran',
' Scotland', ' Jamaica', ' Ecuador', ' Yugoslavia', ' Hungary',
' Hong', ' Greece', ' Trinadad&Tobago',
' Outlying-US(Guam-USVI-etc)', ' France', ' Holand-Netherlands'],
dtype=object)]
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
model = make_pipeline(
OrdinalEncoder(categories=categories),
LogisticRegression(max_iter=1000))
scores = cross_val_score(model, data_categorical, target)
print(f"The different scores obtained are: \n{scores}")
The different scores obtained are:
[0.75207288 0.75545092 0.75665438 0.75665438 0.7528665 ]
print(f"The accuracy is: {scores.mean():.3f} +- {scores.std():.3f}")
The accuracy is: 0.755 +- 0.002
Using an arbitrary mapping from string labels to integers as done here causes the linear model to make bad assumptions on the relative ordering of categories.
This prevent the model to learning anything predictive enough and the cross-validated score is even lower that the baseline we obtained by ignoring the input data and just always predict the most frequent class:
from sklearn.dummy import DummyClassifier
scores = cross_val_score(DummyClassifier(strategy="most_frequent"),
data_categorical, target)
print(f"The different scores obtained are: \n{scores}")
print(f"The accuracy is: {scores.mean():.3f} +- {scores.std():.3f}")
The different scores obtained are:
[0.76067151 0.76067151 0.76074939 0.76074939 0.76074939]
The accuracy is: 0.761 +- 0.000
By comparison, a categorical encoding that does not assume any ordering in the categories can lead to a significantly higher score:
from sklearn.preprocessing import OneHotEncoder
model = make_pipeline(
OneHotEncoder(handle_unknown="ignore"),
LogisticRegression(max_iter=1000))
scores = cross_val_score(model, data_categorical, target)
print(f"The different scores obtained are: \n{scores}")
print(f"The accuracy is: {scores.mean():.3f} +- {scores.std():.3f}")
The different scores obtained are:
[0.83222438 0.83560242 0.82872645 0.83312858 0.83466421]
The accuracy is: 0.833 +- 0.002