Solution for Exercise 02¶

The goal of this exercise is to evalutate the impact of using an arbitrary integer encoding for categorical variables along with a linear classification model such as Logistic Regression.

To do so, let’s try to use OrdinalEncoder to preprocess the categorical variables. This preprocessor is assembled in a pipeline with LogisticRegression. The performance of the pipeline can be evaluated as usual by cross-validation and then compared to the score obtained when using OneHotEncoding or to some other baseline score.

Because OrdinalEncoder can raise errors if it sees an unknown category at prediction time, we need to pre-compute the list of all possible categories ahead of time:

categories = [data[column].unique()
              for column in data[categorical_columns]]
OrdinalEncoder(categories=categories)

import pandas as pd

df = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = df[target_name].to_numpy()
data = df.drop(columns=[target_name, "fnlwgt"])

from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_exclude=["int", "float"])
categorical_columns = categorical_columns_selector(data)
data_categorical = data[categorical_columns]

categories = [
    data[column].unique() for column in data[categorical_columns]]

categories

[array([' Private', ' Local-gov', ' ?', ' Self-emp-not-inc',
        ' Federal-gov', ' State-gov', ' Self-emp-inc', ' Without-pay',
        ' Never-worked'], dtype=object),
 array([' 11th', ' HS-grad', ' Assoc-acdm', ' Some-college', ' 10th',
        ' Prof-school', ' 7th-8th', ' Bachelors', ' Masters', ' Doctorate',
        ' 5th-6th', ' Assoc-voc', ' 9th', ' 12th', ' 1st-4th',
        ' Preschool'], dtype=object),
 array([' Never-married', ' Married-civ-spouse', ' Widowed', ' Divorced',
        ' Separated', ' Married-spouse-absent', ' Married-AF-spouse'],
       dtype=object),
 array([' Machine-op-inspct', ' Farming-fishing', ' Protective-serv', ' ?',
        ' Other-service', ' Prof-specialty', ' Craft-repair',
        ' Adm-clerical', ' Exec-managerial', ' Tech-support', ' Sales',
        ' Priv-house-serv', ' Transport-moving', ' Handlers-cleaners',
        ' Armed-Forces'], dtype=object),
 array([' Own-child', ' Husband', ' Not-in-family', ' Unmarried', ' Wife',
        ' Other-relative'], dtype=object),
 array([' Black', ' White', ' Asian-Pac-Islander', ' Other',
        ' Amer-Indian-Eskimo'], dtype=object),
 array([' Male', ' Female'], dtype=object),
 array([' United-States', ' ?', ' Peru', ' Guatemala', ' Mexico',
        ' Dominican-Republic', ' Ireland', ' Germany', ' Philippines',
        ' Thailand', ' Haiti', ' El-Salvador', ' Puerto-Rico', ' Vietnam',
        ' South', ' Columbia', ' Japan', ' India', ' Cambodia', ' Poland',
        ' Laos', ' England', ' Cuba', ' Taiwan', ' Italy', ' Canada',
        ' Portugal', ' China', ' Nicaragua', ' Honduras', ' Iran',
        ' Scotland', ' Jamaica', ' Ecuador', ' Yugoslavia', ' Hungary',
        ' Hong', ' Greece', ' Trinadad&Tobago',
        ' Outlying-US(Guam-USVI-etc)', ' France', ' Holand-Netherlands'],
       dtype=object)]

from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression

model = make_pipeline(
    OrdinalEncoder(categories=categories),
    LogisticRegression(max_iter=1000))
scores = cross_val_score(model, data_categorical, target)
print(f"The different scores obtained are: \n{scores}")

The different scores obtained are: 
[0.75207288 0.75545092 0.75665438 0.75665438 0.7528665 ]

print(f"The accuracy is: {scores.mean():.3f} +- {scores.std():.3f}")

The accuracy is: 0.755 +- 0.002

Using an arbitrary mapping from string labels to integers as done here causes the linear model to make bad assumptions on the relative ordering of categories.

This prevent the model to learning anything predictive enough and the cross-validated score is even lower that the baseline we obtained by ignoring the input data and just always predict the most frequent class:

from sklearn.dummy import DummyClassifier

scores = cross_val_score(DummyClassifier(strategy="most_frequent"),
                         data_categorical, target)
print(f"The different scores obtained are: \n{scores}")
print(f"The accuracy is: {scores.mean():.3f} +- {scores.std():.3f}")

The different scores obtained are: 
[0.76067151 0.76067151 0.76074939 0.76074939 0.76074939]
The accuracy is: 0.761 +- 0.000

By comparison, a categorical encoding that does not assume any ordering in the categories can lead to a significantly higher score:

from sklearn.preprocessing import OneHotEncoder

model = make_pipeline(
    OneHotEncoder(handle_unknown="ignore"),
    LogisticRegression(max_iter=1000))
scores = cross_val_score(model, data_categorical, target)
print(f"The different scores obtained are: \n{scores}")
print(f"The accuracy is: {scores.mean():.3f} +- {scores.std():.3f}")

The different scores obtained are: 
[0.83222438 0.83560242 0.82872645 0.83312858 0.83466421]
The accuracy is: 0.833 +- 0.002

Scikit-learn tutorial

Solution for Exercise 02¶