Working with both numerical & categorical variables¶
In this notebook, we will present:
typical ways to deal with categorical variables
how to train a predictive model on mixed types of data (i.e. numerical and categorical together)
Let’s first load the data as we did in the previous notebook.
import pandas as pd
df = pd.read_csv("../datasets/adult-census.csv")
target_name = "class"
target = df[target_name].to_numpy()
data = df.drop(columns=[target_name, "fnlwgt"])
Working with categorical variables¶
As we have seen in the previous section, a numerical variable is a continuous quantity represented by a real or integer number. These variables can be naturally handled by machine learning algorithms that are typically composed of a sequence of arithmetic instructions such as additions and multiplications.
In contrast, categorical variables have discrete values, typically represented
by string labels taken from a finite list of possible choices. For instance,
the variable native-country
in our dataset is a categorical variable because
it encodes the data using a finite list of possible countries (along with the
?
symbol when this information is missing):
data["native-country"].value_counts()
United-States 43832
Mexico 951
? 857
Philippines 295
Germany 206
Puerto-Rico 184
Canada 182
El-Salvador 155
India 151
Cuba 138
England 127
China 122
South 115
Jamaica 106
Italy 105
Dominican-Republic 103
Japan 92
Guatemala 88
Poland 87
Vietnam 86
Columbia 85
Haiti 75
Portugal 67
Taiwan 65
Iran 59
Greece 49
Nicaragua 49
Peru 46
Ecuador 45
France 38
Ireland 37
Hong 30
Thailand 30
Cambodia 28
Trinadad&Tobago 27
Yugoslavia 23
Outlying-US(Guam-USVI-etc) 23
Laos 23
Scotland 21
Honduras 20
Hungary 19
Holand-Netherlands 1
Name: native-country, dtype: int64
In the remainder of this section, we will present different strategies to encode categorical data into numerical data which can be used by a machine-learning algorithm.
data.dtypes
age int64
workclass object
education object
education-num int64
marital-status object
occupation object
relationship object
race object
sex object
capital-gain int64
capital-loss int64
hours-per-week int64
native-country object
dtype: object
from sklearn.compose import make_column_selector as selector
categorical_columns_selector = selector(dtype_exclude=["int", "float"])
categorical_columns = categorical_columns_selector(data)
categorical_columns
['workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'native-country']
data_categorical = data[categorical_columns]
data_categorical.head()
workclass | education | marital-status | occupation | relationship | race | sex | native-country | |
---|---|---|---|---|---|---|---|---|
0 | Private | 11th | Never-married | Machine-op-inspct | Own-child | Black | Male | United-States |
1 | Private | HS-grad | Married-civ-spouse | Farming-fishing | Husband | White | Male | United-States |
2 | Local-gov | Assoc-acdm | Married-civ-spouse | Protective-serv | Husband | White | Male | United-States |
3 | Private | Some-college | Married-civ-spouse | Machine-op-inspct | Husband | Black | Male | United-States |
4 | ? | Some-college | Never-married | ? | Own-child | White | Female | United-States |
print(
f"The dataset is composed of {data_categorical.shape[1]} features"
)
The dataset is composed of 8 features
Encoding ordinal categories¶
The most intuitive strategy is to encode each category with a different
number. The OrdinalEncoder
will transform the data in such manner.
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]
array([[ 4., 1., 4., 7., 3., 2., 1., 39.],
[ 4., 11., 2., 5., 0., 4., 1., 39.],
[ 2., 7., 2., 11., 0., 4., 1., 39.],
[ 4., 15., 2., 7., 0., 2., 1., 39.],
[ 0., 15., 4., 0., 3., 4., 0., 39.]])
print(
f"The dataset encoded contains {data_encoded.shape[1]} features")
The dataset encoded contains 8 features
We can see that the categories have been encoded for each feature (column) independently. We can also note that the number of features before and after the encoding is the same.
However, one has to be careful when using this encoding strategy. Using this integer representation can lead the downstream models to make the assumption that the categories are ordered: 0 is smaller than 1 which is smaller than 2, etc.
By default, OrdinalEncoder
uses a lexicographical strategy to map string
category labels to integers. This strategy is completely arbitrary and often be
meaningless. For instance suppose the dataset has a categorical variable named
“size” with categories such as “S”, “M”, “L”, “XL”. We would like the integer
representation to respect the meaning of the sizes by mapping them to increasing
integers such as 0, 1, 2, 3. However lexicographical strategy used by default
would map the labels “S”, “M”, “L”, “XL” to 2, 1, 0, 3.
The OrdinalEncoder
class accepts a “categories” constructor argument to pass
in the correct ordering explicitly.
If a categorical variable does not carry any meaningful order information then this encoding might be misleading to downstream statistical models and you might consider using one-hot encoding instead (see below).
Note however that the impact of violating this ordering assumption is really dependent on the downstream models (for instance linear models are much more sensitive than models built from a ensemble of decision trees).
Encoding nominal categories (without assuming any order)¶
OneHotEncoder
is an alternative encoder that can prevent the dowstream
models to make a false assumption about the ordering of categories. For a
given feature, it will create as many new columns as there are possible
categories. For a given sample, the value of the column corresponding to the
category will be set to 1
while all the columns of the other categories will
be set to 0
.
print(
f"The dataset is composed of {data_categorical.shape[1]} features"
)
data_categorical.head()
The dataset is composed of 8 features
workclass | education | marital-status | occupation | relationship | race | sex | native-country | |
---|---|---|---|---|---|---|---|---|
0 | Private | 11th | Never-married | Machine-op-inspct | Own-child | Black | Male | United-States |
1 | Private | HS-grad | Married-civ-spouse | Farming-fishing | Husband | White | Male | United-States |
2 | Local-gov | Assoc-acdm | Married-civ-spouse | Protective-serv | Husband | White | Male | United-States |
3 | Private | Some-college | Married-civ-spouse | Machine-op-inspct | Husband | Black | Male | United-States |
4 | ? | Some-college | Never-married | ? | Own-child | White | Female | United-States |
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]
array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1.,
0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 1., 0., 0.],
[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,
1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 1., 0., 0.]])
print(
f"The dataset encoded contains {data_encoded.shape[1]} features")
The dataset encoded contains 102 features
Let’s wrap this numpy array in a dataframe with informative column names as provided by the encoder object:
columns_encoded = encoder.get_feature_names(data_categorical.columns)
pd.DataFrame(data_encoded, columns=columns_encoded).head()
workclass_ ? | workclass_ Federal-gov | workclass_ Local-gov | workclass_ Never-worked | workclass_ Private | workclass_ Self-emp-inc | workclass_ Self-emp-not-inc | workclass_ State-gov | workclass_ Without-pay | education_ 10th | ... | native-country_ Portugal | native-country_ Puerto-Rico | native-country_ Scotland | native-country_ South | native-country_ Taiwan | native-country_ Thailand | native-country_ Trinadad&Tobago | native-country_ United-States | native-country_ Vietnam | native-country_ Yugoslavia | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
5 rows × 102 columns
Look at how the “workclass” variable of the first 3 records has been encoded and compare this to the original string representation.
The number of features after the encoding is more than 10 times larger than in the
original data because some variables such as occupation
and native-country
have many possible categories.
We can now integrate this encoder inside a machine learning pipeline like we did with numerical data: let’s train a linear classifier on the encoded data and check the performance of this machine learning pipeline using cross-validation.
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
model = make_pipeline(
OneHotEncoder(handle_unknown='ignore'),
LogisticRegression(max_iter=1000))
scores = cross_val_score(model, data_categorical, target)
scores
array([0.83222438, 0.83560242, 0.82872645, 0.83312858, 0.83466421])
print(f"The accuracy is: {scores.mean():.3f} +/- {scores.std():.3f}")
The accuracy is: 0.833 +/- 0.002
As you can see, this representation of the categorical variables of the data is slightly more predictive of the revenue than the numerical variables that we used previously.
Exercise 1:¶
Try to fit a logistic regression model on categorical data transformed by the OrdinalEncoder instead. What do you observe?
Open the dedicated notebook to do this exercise.
Using numerical and categorical variables together¶
In the previous sections, we saw that we need to treat data differently depending on their nature (i.e. numerical or categorical).
Scikit-learn provides a ColumnTransformer
class which will send
specific columns to a specific transformer, making it easy to fit a single
predictive model on a dataset that combines both kinds of variables together
(heterogeneously typed tabular data).
We can first define the columns depending on their data type:
binary encoding will be applied to categorical columns with only too possible values (e.g. sex=male or sex=female in this example). Each binary categorical columns will be mapped to one numerical columns with 0 or 1 values.
one-hot encoding will be applied to categorical columns with more that two possible categories. This encoding will create one additional column for each possible categorical value.
numerical scaling numerical features which will be standardized.
binary_encoding_columns = ['sex']
one_hot_encoding_columns = [
'workclass', 'education', 'marital-status', 'occupation',
'relationship', 'race', 'native-country']
scaling_columns = [
'age', 'education-num', 'hours-per-week', 'capital-gain',
'capital-loss']
We can now create our ColumnTransfomer
by specifying a list of triplet
(preprocessor name, transformer, columns). Finally, we can define a pipeline
to stack this “preprocessor” with our classifier (logistic regression).
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer([
('binary-encoder', OrdinalEncoder(), binary_encoding_columns),
('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'),
one_hot_encoding_columns),
('standard-scaler', StandardScaler(), scaling_columns)])
model = make_pipeline(
preprocessor, LogisticRegression(max_iter=1000))
Starting from scikit-learn
0.23, the notebooks can display an interactive view of the pipelines.
from sklearn import set_config
set_config(display='diagram')
model
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('binary-encoder', OrdinalEncoder(), ['sex']), ('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'), ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'native-country']), ('standard-scaler', StandardScaler(), ['age', 'education-num', 'hours-per-week', 'capital-gain', 'capital-loss'])])), ('logisticregression', LogisticRegression(max_iter=1000))])
ColumnTransformer(transformers=[('binary-encoder', OrdinalEncoder(), ['sex']), ('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'), ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'native-country']), ('standard-scaler', StandardScaler(), ['age', 'education-num', 'hours-per-week', 'capital-gain', 'capital-loss'])])
['sex']
OrdinalEncoder()
['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'native-country']
OneHotEncoder(handle_unknown='ignore')
['age', 'education-num', 'hours-per-week', 'capital-gain', 'capital-loss']
StandardScaler()
LogisticRegression(max_iter=1000)
The final model is more complex than the previous models but still follows the same API:
the
fit
method is called to preprocess the data then train the classifier;the
predict
method can make predictions on new data;the
score
method is used to predict on the test data and compare the predictions to the expected test labels to compute the accuracy.
from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(
data, target, random_state=42)
_ = model.fit(data_train, target_train)
data_test.head()
age | workclass | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7762 | 56 | Private | HS-grad | 9 | Divorced | Other-service | Unmarried | White | Female | 0 | 0 | 40 | United-States |
23881 | 25 | Private | HS-grad | 9 | Married-civ-spouse | Transport-moving | Own-child | Other | Male | 0 | 0 | 40 | United-States |
30507 | 43 | Private | Bachelors | 13 | Divorced | Prof-specialty | Not-in-family | White | Female | 14344 | 0 | 40 | United-States |
28911 | 32 | Private | HS-grad | 9 | Married-civ-spouse | Transport-moving | Husband | White | Male | 0 | 0 | 40 | United-States |
19484 | 39 | Private | Bachelors | 13 | Married-civ-spouse | Sales | Wife | White | Female | 0 | 0 | 30 | United-States |
model.predict(data_test)[:5]
array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' >50K'], dtype=object)
target_test[:5]
array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' <=50K'], dtype=object)
model.score(data_test, target_test)
0.8577512079272787
This model can also be cross-validated as usual (instead of using a single train-test split):
scores = cross_val_score(model, data, target, cv=5)
scores
array([0.85116184, 0.8498311 , 0.84756347, 0.85268223, 0.85513923])
print(f"The accuracy is: {scores.mean():.3f} +- {scores.std():.3f}")
The accuracy is: 0.851 +- 0.003
The compound model has a higher predictive accuracy than the two models that used numerical and categorical variables in isolation.
Fitting a more powerful model¶
Linear models are very nice because they are usually very cheap to train, small to deploy, fast to predict and give a good baseline.
However, it is often useful to check whether more complex models such as an ensemble of decision trees can lead to higher predictive performance.
In the following cell we try a scalable implementation of the Gradient Boosting Machine algorithm. For this class of models, we know that contrary to linear models, it is useless to scale the numerical features and furthermore it is both safe and significantly more computationally efficient to use an arbitrary integer encoding for the categorical variables even if the ordering is arbitrary. Therefore we adapt the preprocessing pipeline as follows:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
# For each categorical column, extract the list of all possible categories
# in some arbritrary order.
categories = [
data[column].unique()
for column in categorical_columns
]
preprocessor = ColumnTransformer([
('categorical', OrdinalEncoder(categories=categories),
categorical_columns)], remainder="passthrough")
model = make_pipeline(preprocessor, HistGradientBoostingClassifier())
%%time
_ = model.fit(data_train, target_train)
CPU times: user 3.32 s, sys: 67.4 ms, total: 3.38 s
Wall time: 956 ms
model.score(data_test, target_test)
0.8789615920072066
We can observe that we get significantly higher accuracies with the Gradient Boosting model. This is often what we observe whenever the dataset has a large number of samples and limited number of informative features (e.g. less than 1000) with a mix of numerical and categorical variables.
This explains why Gradient Boosted Machines are very popular among datascience practitioners who work with tabular data.
Exercise 2:¶
Check that scaling the numerical features does not impact the speed or accuracy of
HistGradientBoostingClassifier
Check that one-hot encoding the categorical variable does not improve the accuracy of
HistGradientBoostingClassifier
but slows down the training.
Open the dedicated notebook to do this exercise.
In this notebook we have:
encoded categorical features with both an ordinal encoding and an one hot encoding
used a pipeline to process both numerical and categorical features before fitting a logistic regression
seen that gradient boosting methods can outperforms the basic linear approach