Working with both numerical & categorical variables¶

In this notebook, we will present:

typical ways to deal with categorical variables
how to train a predictive model on mixed types of data (i.e. numerical and categorical together)

Let’s first load the data as we did in the previous notebook.

import pandas as pd

df = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = df[target_name].to_numpy()

data = df.drop(columns=[target_name, "fnlwgt"])

Working with categorical variables¶

As we have seen in the previous section, a numerical variable is a continuous quantity represented by a real or integer number. These variables can be naturally handled by machine learning algorithms that are typically composed of a sequence of arithmetic instructions such as additions and multiplications.

In contrast, categorical variables have discrete values, typically represented by string labels taken from a finite list of possible choices. For instance, the variable native-country in our dataset is a categorical variable because it encodes the data using a finite list of possible countries (along with the ? symbol when this information is missing):

data["native-country"].value_counts()

 United-States                 43832
 Mexico                          951
 ?                               857
 Philippines                     295
 Germany                         206
 Puerto-Rico                     184
 Canada                          182
 El-Salvador                     155
 India                           151
 Cuba                            138
 England                         127
 China                           122
 South                           115
 Jamaica                         106
 Italy                           105
 Dominican-Republic              103
 Japan                            92
 Guatemala                        88
 Poland                           87
 Vietnam                          86
 Columbia                         85
 Haiti                            75
 Portugal                         67
 Taiwan                           65
 Iran                             59
 Greece                           49
 Nicaragua                        49
 Peru                             46
 Ecuador                          45
 France                           38
 Ireland                          37
 Hong                             30
 Thailand                         30
 Cambodia                         28
 Trinadad&Tobago                  27
 Yugoslavia                       23
 Outlying-US(Guam-USVI-etc)       23
 Laos                             23
 Scotland                         21
 Honduras                         20
 Hungary                          19
 Holand-Netherlands                1
Name: native-country, dtype: int64

In the remainder of this section, we will present different strategies to encode categorical data into numerical data which can be used by a machine-learning algorithm.

data.dtypes

age                int64
workclass         object
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_exclude=["int", "float"])
categorical_columns = categorical_columns_selector(data)
categorical_columns

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

data_categorical = data[categorical_columns]
data_categorical.head()

	workclass	education	marital-status	occupation	relationship	race	sex	native-country
0	Private	11th	Never-married	Machine-op-inspct	Own-child	Black	Male	United-States
1	Private	HS-grad	Married-civ-spouse	Farming-fishing	Husband	White	Male	United-States
2	Local-gov	Assoc-acdm	Married-civ-spouse	Protective-serv	Husband	White	Male	United-States
3	Private	Some-college	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	United-States
4	?	Some-college	Never-married	?	Own-child	White	Female	United-States

print(
    f"The dataset is composed of {data_categorical.shape[1]} features"
)

The dataset is composed of 8 features

Encoding ordinal categories¶

The most intuitive strategy is to encode each category with a different number. The OrdinalEncoder will transform the data in such manner.

from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]

array([[ 4.,  1.,  4.,  7.,  3.,  2.,  1., 39.],
       [ 4., 11.,  2.,  5.,  0.,  4.,  1., 39.],
       [ 2.,  7.,  2., 11.,  0.,  4.,  1., 39.],
       [ 4., 15.,  2.,  7.,  0.,  2.,  1., 39.],
       [ 0., 15.,  4.,  0.,  3.,  4.,  0., 39.]])

print(
    f"The dataset encoded contains {data_encoded.shape[1]} features")

The dataset encoded contains 8 features

We can see that the categories have been encoded for each feature (column) independently. We can also note that the number of features before and after the encoding is the same.

However, one has to be careful when using this encoding strategy. Using this integer representation can lead the downstream models to make the assumption that the categories are ordered: 0 is smaller than 1 which is smaller than 2, etc.

By default, OrdinalEncoder uses a lexicographical strategy to map string category labels to integers. This strategy is completely arbitrary and often be meaningless. For instance suppose the dataset has a categorical variable named “size” with categories such as “S”, “M”, “L”, “XL”. We would like the integer representation to respect the meaning of the sizes by mapping them to increasing integers such as 0, 1, 2, 3. However lexicographical strategy used by default would map the labels “S”, “M”, “L”, “XL” to 2, 1, 0, 3.

The OrdinalEncoder class accepts a “categories” constructor argument to pass in the correct ordering explicitly.

If a categorical variable does not carry any meaningful order information then this encoding might be misleading to downstream statistical models and you might consider using one-hot encoding instead (see below).

Note however that the impact of violating this ordering assumption is really dependent on the downstream models (for instance linear models are much more sensitive than models built from a ensemble of decision trees).

Encoding nominal categories (without assuming any order)¶

OneHotEncoder is an alternative encoder that can prevent the dowstream models to make a false assumption about the ordering of categories. For a given feature, it will create as many new columns as there are possible categories. For a given sample, the value of the column corresponding to the category will be set to 1 while all the columns of the other categories will be set to 0.

print(
    f"The dataset is composed of {data_categorical.shape[1]} features"
)
data_categorical.head()

The dataset is composed of 8 features

	workclass	education	marital-status	occupation	relationship	race	sex	native-country
0	Private	11th	Never-married	Machine-op-inspct	Own-child	Black	Male	United-States
1	Private	HS-grad	Married-civ-spouse	Farming-fishing	Husband	White	Male	United-States
2	Local-gov	Assoc-acdm	Married-civ-spouse	Protective-serv	Husband	White	Male	United-States
3	Private	Some-college	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	United-States
4	?	Some-college	Never-married	?	Own-child	White	Female	United-States

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]

array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.]])

print(
    f"The dataset encoded contains {data_encoded.shape[1]} features")

The dataset encoded contains 102 features

Let’s wrap this numpy array in a dataframe with informative column names as provided by the encoder object:

columns_encoded = encoder.get_feature_names(data_categorical.columns)
pd.DataFrame(data_encoded, columns=columns_encoded).head()

	workclass_ ?	workclass_ Local-gov	workclass_ Private	...	native-country_ United-States
0	0.0	0.0	1.0	...	1.0
1	0.0	0.0	1.0	...	1.0
2	0.0	1.0	0.0	...	1.0
3	0.0	0.0	1.0	...	1.0
4	1.0	0.0	0.0	...	1.0

5 rows × 102 columns

Look at how the “workclass” variable of the first 3 records has been encoded and compare this to the original string representation.

The number of features after the encoding is more than 10 times larger than in the original data because some variables such as occupation and native-country have many possible categories.

We can now integrate this encoder inside a machine learning pipeline like we did with numerical data: let’s train a linear classifier on the encoded data and check the performance of this machine learning pipeline using cross-validation.

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

model = make_pipeline(
    OneHotEncoder(handle_unknown='ignore'),
    LogisticRegression(max_iter=1000))

scores = cross_val_score(model, data_categorical, target)
scores

array([0.83222438, 0.83560242, 0.82872645, 0.83312858, 0.83466421])

print(f"The accuracy is: {scores.mean():.3f} +/- {scores.std():.3f}")

The accuracy is: 0.833 +/- 0.002

As you can see, this representation of the categorical variables of the data is slightly more predictive of the revenue than the numerical variables that we used previously.

Exercise 1:¶

Try to fit a logistic regression model on categorical data transformed by the OrdinalEncoder instead. What do you observe?

Open the dedicated notebook to do this exercise.

Using numerical and categorical variables together¶

In the previous sections, we saw that we need to treat data differently depending on their nature (i.e. numerical or categorical).

Scikit-learn provides a ColumnTransformer class which will send specific columns to a specific transformer, making it easy to fit a single predictive model on a dataset that combines both kinds of variables together (heterogeneously typed tabular data).

We can first define the columns depending on their data type:

binary encoding will be applied to categorical columns with only too possible values (e.g. sex=male or sex=female in this example). Each binary categorical columns will be mapped to one numerical columns with 0 or 1 values.
one-hot encoding will be applied to categorical columns with more that two possible categories. This encoding will create one additional column for each possible categorical value.
numerical scaling numerical features which will be standardized.

binary_encoding_columns = ['sex']

one_hot_encoding_columns = [
    'workclass', 'education', 'marital-status', 'occupation',
    'relationship', 'race', 'native-country']

scaling_columns = [
    'age', 'education-num', 'hours-per-week', 'capital-gain',
    'capital-loss']

We can now create our ColumnTransfomer by specifying a list of triplet (preprocessor name, transformer, columns). Finally, we can define a pipeline to stack this “preprocessor” with our classifier (logistic regression).

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('binary-encoder', OrdinalEncoder(), binary_encoding_columns),
    ('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'),
     one_hot_encoding_columns),
    ('standard-scaler', StandardScaler(), scaling_columns)])

model = make_pipeline(
    preprocessor, LogisticRegression(max_iter=1000))

Starting from scikit-learn 0.23, the notebooks can display an interactive view of the pipelines.

from sklearn import set_config
set_config(display='diagram')

model

Pipeline

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('binary-encoder',
                                                  OrdinalEncoder(), ['sex']),
                                                 ('one-hot-encoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['workclass', 'education',
                                                   'marital-status',
                                                   'occupation', 'relationship',
                                                   'race', 'native-country']),
                                                 ('standard-scaler',
                                                  StandardScaler(),
                                                  ['age', 'education-num',
                                                   'hours-per-week',
                                                   'capital-gain',
                                                   'capital-loss'])])),
                ('logisticregression', LogisticRegression(max_iter=1000))])

columntransformer: ColumnTransformer

ColumnTransformer(transformers=[('binary-encoder', OrdinalEncoder(), ['sex']),
                                ('one-hot-encoder',
                                 OneHotEncoder(handle_unknown='ignore'),
                                 ['workclass', 'education', 'marital-status',
                                  'occupation', 'relationship', 'race',
                                  'native-country']),
                                ('standard-scaler', StandardScaler(),
                                 ['age', 'education-num', 'hours-per-week',
                                  'capital-gain', 'capital-loss'])])

binary-encoder

['sex']

OrdinalEncoder

OrdinalEncoder()

one-hot-encoder

['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'native-country']

OneHotEncoder

OneHotEncoder(handle_unknown='ignore')

standard-scaler

['age', 'education-num', 'hours-per-week', 'capital-gain', 'capital-loss']

StandardScaler

StandardScaler()

LogisticRegression

LogisticRegression(max_iter=1000)

The final model is more complex than the previous models but still follows the same API:

the fit method is called to preprocess the data then train the classifier;
the predict method can make predictions on new data;
the score method is used to predict on the test data and compare the predictions to the expected test labels to compute the accuracy.

from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42)

_ = model.fit(data_train, target_train)

data_test.head()

	age	workclass	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country
7762	56	Private	HS-grad	9	Divorced	Other-service	Unmarried	White	Female	0	40	United-States
23881	25	Private	HS-grad	9	Married-civ-spouse	Transport-moving	Own-child	Other	Male	0	40	United-States
30507	43	Private	Bachelors	13	Divorced	Prof-specialty	Not-in-family	White	Female	14344	40	United-States
28911	32	Private	HS-grad	9	Married-civ-spouse	Transport-moving	Husband	White	Male	0	40	United-States
19484	39	Private	Bachelors	13	Married-civ-spouse	Sales	Wife	White	Female	0	30	United-States

model.predict(data_test)[:5]

array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' >50K'], dtype=object)

target_test[:5]

array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' <=50K'], dtype=object)

model.score(data_test, target_test)

0.8577512079272787

This model can also be cross-validated as usual (instead of using a single train-test split):

scores = cross_val_score(model, data, target, cv=5)
scores

array([0.85116184, 0.8498311 , 0.84756347, 0.85268223, 0.85513923])

print(f"The accuracy is: {scores.mean():.3f} +- {scores.std():.3f}")

The accuracy is: 0.851 +- 0.003

The compound model has a higher predictive accuracy than the two models that used numerical and categorical variables in isolation.

Scikit-learn tutorial