Working with both numerical & categorical variables

In this notebook, we will present:

  • typical ways to deal with categorical variables

  • how to train a predictive model on mixed types of data (i.e. numerical and categorical together)

Let’s first load the data as we did in the previous notebook.

import pandas as pd

df = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = df[target_name].to_numpy()

data = df.drop(columns=[target_name, "fnlwgt"])

Working with categorical variables

As we have seen in the previous section, a numerical variable is a continuous quantity represented by a real or integer number. These variables can be naturally handled by machine learning algorithms that are typically composed of a sequence of arithmetic instructions such as additions and multiplications.

In contrast, categorical variables have discrete values, typically represented by string labels taken from a finite list of possible choices. For instance, the variable native-country in our dataset is a categorical variable because it encodes the data using a finite list of possible countries (along with the ? symbol when this information is missing):

data["native-country"].value_counts()
 United-States                 43832
 Mexico                          951
 ?                               857
 Philippines                     295
 Germany                         206
 Puerto-Rico                     184
 Canada                          182
 El-Salvador                     155
 India                           151
 Cuba                            138
 England                         127
 China                           122
 South                           115
 Jamaica                         106
 Italy                           105
 Dominican-Republic              103
 Japan                            92
 Guatemala                        88
 Poland                           87
 Vietnam                          86
 Columbia                         85
 Haiti                            75
 Portugal                         67
 Taiwan                           65
 Iran                             59
 Greece                           49
 Nicaragua                        49
 Peru                             46
 Ecuador                          45
 France                           38
 Ireland                          37
 Hong                             30
 Thailand                         30
 Cambodia                         28
 Trinadad&Tobago                  27
 Yugoslavia                       23
 Outlying-US(Guam-USVI-etc)       23
 Laos                             23
 Scotland                         21
 Honduras                         20
 Hungary                          19
 Holand-Netherlands                1
Name: native-country, dtype: int64

In the remainder of this section, we will present different strategies to encode categorical data into numerical data which can be used by a machine-learning algorithm.

data.dtypes
age                int64
workclass         object
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_exclude=["int", "float"])
categorical_columns = categorical_columns_selector(data)
categorical_columns
['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']
data_categorical = data[categorical_columns]
data_categorical.head()
workclass education marital-status occupation relationship race sex native-country
0 Private 11th Never-married Machine-op-inspct Own-child Black Male United-States
1 Private HS-grad Married-civ-spouse Farming-fishing Husband White Male United-States
2 Local-gov Assoc-acdm Married-civ-spouse Protective-serv Husband White Male United-States
3 Private Some-college Married-civ-spouse Machine-op-inspct Husband Black Male United-States
4 ? Some-college Never-married ? Own-child White Female United-States
print(
    f"The dataset is composed of {data_categorical.shape[1]} features"
)
The dataset is composed of 8 features

Encoding ordinal categories

The most intuitive strategy is to encode each category with a different number. The OrdinalEncoder will transform the data in such manner.

from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]
array([[ 4.,  1.,  4.,  7.,  3.,  2.,  1., 39.],
       [ 4., 11.,  2.,  5.,  0.,  4.,  1., 39.],
       [ 2.,  7.,  2., 11.,  0.,  4.,  1., 39.],
       [ 4., 15.,  2.,  7.,  0.,  2.,  1., 39.],
       [ 0., 15.,  4.,  0.,  3.,  4.,  0., 39.]])
print(
    f"The dataset encoded contains {data_encoded.shape[1]} features")
The dataset encoded contains 8 features

We can see that the categories have been encoded for each feature (column) independently. We can also note that the number of features before and after the encoding is the same.

However, one has to be careful when using this encoding strategy. Using this integer representation can lead the downstream models to make the assumption that the categories are ordered: 0 is smaller than 1 which is smaller than 2, etc.

By default, OrdinalEncoder uses a lexicographical strategy to map string category labels to integers. This strategy is completely arbitrary and often be meaningless. For instance suppose the dataset has a categorical variable named “size” with categories such as “S”, “M”, “L”, “XL”. We would like the integer representation to respect the meaning of the sizes by mapping them to increasing integers such as 0, 1, 2, 3. However lexicographical strategy used by default would map the labels “S”, “M”, “L”, “XL” to 2, 1, 0, 3.

The OrdinalEncoder class accepts a “categories” constructor argument to pass in the correct ordering explicitly.

If a categorical variable does not carry any meaningful order information then this encoding might be misleading to downstream statistical models and you might consider using one-hot encoding instead (see below).

Note however that the impact of violating this ordering assumption is really dependent on the downstream models (for instance linear models are much more sensitive than models built from a ensemble of decision trees).

Encoding nominal categories (without assuming any order)

OneHotEncoder is an alternative encoder that can prevent the dowstream models to make a false assumption about the ordering of categories. For a given feature, it will create as many new columns as there are possible categories. For a given sample, the value of the column corresponding to the category will be set to 1 while all the columns of the other categories will be set to 0.

print(
    f"The dataset is composed of {data_categorical.shape[1]} features"
)
data_categorical.head()
The dataset is composed of 8 features
workclass education marital-status occupation relationship race sex native-country
0 Private 11th Never-married Machine-op-inspct Own-child Black Male United-States
1 Private HS-grad Married-civ-spouse Farming-fishing Husband White Male United-States
2 Local-gov Assoc-acdm Married-civ-spouse Protective-serv Husband White Male United-States
3 Private Some-college Married-civ-spouse Machine-op-inspct Husband Black Male United-States
4 ? Some-college Never-married ? Own-child White Female United-States
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]
array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.]])
print(
    f"The dataset encoded contains {data_encoded.shape[1]} features")
The dataset encoded contains 102 features

Let’s wrap this numpy array in a dataframe with informative column names as provided by the encoder object:

columns_encoded = encoder.get_feature_names(data_categorical.columns)
pd.DataFrame(data_encoded, columns=columns_encoded).head()
workclass_ ? workclass_ Federal-gov workclass_ Local-gov workclass_ Never-worked workclass_ Private workclass_ Self-emp-inc workclass_ Self-emp-not-inc workclass_ State-gov workclass_ Without-pay education_ 10th ... native-country_ Portugal native-country_ Puerto-Rico native-country_ Scotland native-country_ South native-country_ Taiwan native-country_ Thailand native-country_ Trinadad&Tobago native-country_ United-States native-country_ Vietnam native-country_ Yugoslavia
0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
3 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
4 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

5 rows × 102 columns

Look at how the “workclass” variable of the first 3 records has been encoded and compare this to the original string representation.

The number of features after the encoding is more than 10 times larger than in the original data because some variables such as occupation and native-country have many possible categories.

We can now integrate this encoder inside a machine learning pipeline like we did with numerical data: let’s train a linear classifier on the encoded data and check the performance of this machine learning pipeline using cross-validation.

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

model = make_pipeline(
    OneHotEncoder(handle_unknown='ignore'),
    LogisticRegression(max_iter=1000))
scores = cross_val_score(model, data_categorical, target)
scores
array([0.83222438, 0.83560242, 0.82872645, 0.83312858, 0.83466421])
print(f"The accuracy is: {scores.mean():.3f} +/- {scores.std():.3f}")
The accuracy is: 0.833 +/- 0.002

As you can see, this representation of the categorical variables of the data is slightly more predictive of the revenue than the numerical variables that we used previously.

Exercise 1:

  • Try to fit a logistic regression model on categorical data transformed by the OrdinalEncoder instead. What do you observe?

Open the dedicated notebook to do this exercise.

Using numerical and categorical variables together

In the previous sections, we saw that we need to treat data differently depending on their nature (i.e. numerical or categorical).

Scikit-learn provides a ColumnTransformer class which will send specific columns to a specific transformer, making it easy to fit a single predictive model on a dataset that combines both kinds of variables together (heterogeneously typed tabular data).

We can first define the columns depending on their data type:

  • binary encoding will be applied to categorical columns with only too possible values (e.g. sex=male or sex=female in this example). Each binary categorical columns will be mapped to one numerical columns with 0 or 1 values.

  • one-hot encoding will be applied to categorical columns with more that two possible categories. This encoding will create one additional column for each possible categorical value.

  • numerical scaling numerical features which will be standardized.

binary_encoding_columns = ['sex']

one_hot_encoding_columns = [
    'workclass', 'education', 'marital-status', 'occupation',
    'relationship', 'race', 'native-country']

scaling_columns = [
    'age', 'education-num', 'hours-per-week', 'capital-gain',
    'capital-loss']

We can now create our ColumnTransfomer by specifying a list of triplet (preprocessor name, transformer, columns). Finally, we can define a pipeline to stack this “preprocessor” with our classifier (logistic regression).

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('binary-encoder', OrdinalEncoder(), binary_encoding_columns),
    ('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'),
     one_hot_encoding_columns),
    ('standard-scaler', StandardScaler(), scaling_columns)])
model = make_pipeline(
    preprocessor, LogisticRegression(max_iter=1000))

Starting from scikit-learn 0.23, the notebooks can display an interactive view of the pipelines.

from sklearn import set_config
set_config(display='diagram')

model
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('binary-encoder',
                                                  OrdinalEncoder(), ['sex']),
                                                 ('one-hot-encoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['workclass', 'education',
                                                   'marital-status',
                                                   'occupation', 'relationship',
                                                   'race', 'native-country']),
                                                 ('standard-scaler',
                                                  StandardScaler(),
                                                  ['age', 'education-num',
                                                   'hours-per-week',
                                                   'capital-gain',
                                                   'capital-loss'])])),
                ('logisticregression', LogisticRegression(max_iter=1000))])
ColumnTransformer(transformers=[('binary-encoder', OrdinalEncoder(), ['sex']),
                                ('one-hot-encoder',
                                 OneHotEncoder(handle_unknown='ignore'),
                                 ['workclass', 'education', 'marital-status',
                                  'occupation', 'relationship', 'race',
                                  'native-country']),
                                ('standard-scaler', StandardScaler(),
                                 ['age', 'education-num', 'hours-per-week',
                                  'capital-gain', 'capital-loss'])])
['sex']
OrdinalEncoder()
['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'native-country']
OneHotEncoder(handle_unknown='ignore')
['age', 'education-num', 'hours-per-week', 'capital-gain', 'capital-loss']
StandardScaler()
LogisticRegression(max_iter=1000)

The final model is more complex than the previous models but still follows the same API:

  • the fit method is called to preprocess the data then train the classifier;

  • the predict method can make predictions on new data;

  • the score method is used to predict on the test data and compare the predictions to the expected test labels to compute the accuracy.

from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42)
_ = model.fit(data_train, target_train)
data_test.head()
age workclass education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
7762 56 Private HS-grad 9 Divorced Other-service Unmarried White Female 0 0 40 United-States
23881 25 Private HS-grad 9 Married-civ-spouse Transport-moving Own-child Other Male 0 0 40 United-States
30507 43 Private Bachelors 13 Divorced Prof-specialty Not-in-family White Female 14344 0 40 United-States
28911 32 Private HS-grad 9 Married-civ-spouse Transport-moving Husband White Male 0 0 40 United-States
19484 39 Private Bachelors 13 Married-civ-spouse Sales Wife White Female 0 0 30 United-States
model.predict(data_test)[:5]
array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' >50K'], dtype=object)
target_test[:5]
array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' <=50K'], dtype=object)
model.score(data_test, target_test)
0.8577512079272787

This model can also be cross-validated as usual (instead of using a single train-test split):

scores = cross_val_score(model, data, target, cv=5)
scores
array([0.85116184, 0.8498311 , 0.84756347, 0.85268223, 0.85513923])
print(f"The accuracy is: {scores.mean():.3f} +- {scores.std():.3f}")
The accuracy is: 0.851 +- 0.003

The compound model has a higher predictive accuracy than the two models that used numerical and categorical variables in isolation.

Fitting a more powerful model

Linear models are very nice because they are usually very cheap to train, small to deploy, fast to predict and give a good baseline.

However, it is often useful to check whether more complex models such as an ensemble of decision trees can lead to higher predictive performance.

In the following cell we try a scalable implementation of the Gradient Boosting Machine algorithm. For this class of models, we know that contrary to linear models, it is useless to scale the numerical features and furthermore it is both safe and significantly more computationally efficient to use an arbitrary integer encoding for the categorical variables even if the ordering is arbitrary. Therefore we adapt the preprocessing pipeline as follows:

from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

# For each categorical column, extract the list of all possible categories
# in some arbritrary order.
categories = [
    data[column].unique()
    for column in categorical_columns
]

preprocessor = ColumnTransformer([
    ('categorical', OrdinalEncoder(categories=categories),
     categorical_columns)], remainder="passthrough")

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())
%%time
_ = model.fit(data_train, target_train)
CPU times: user 3.32 s, sys: 67.4 ms, total: 3.38 s
Wall time: 956 ms
model.score(data_test, target_test)
0.8789615920072066

We can observe that we get significantly higher accuracies with the Gradient Boosting model. This is often what we observe whenever the dataset has a large number of samples and limited number of informative features (e.g. less than 1000) with a mix of numerical and categorical variables.

This explains why Gradient Boosted Machines are very popular among datascience practitioners who work with tabular data.

Exercise 2:

  • Check that scaling the numerical features does not impact the speed or accuracy of HistGradientBoostingClassifier

  • Check that one-hot encoding the categorical variable does not improve the accuracy of HistGradientBoostingClassifier but slows down the training.

Open the dedicated notebook to do this exercise.

In this notebook we have:

  • encoded categorical features with both an ordinal encoding and an one hot encoding

  • used a pipeline to process both numerical and categorical features before fitting a logistic regression

  • seen that gradient boosting methods can outperforms the basic linear approach