Solution for Exercise 01¶

The goal of this exercise is to compare the performance of our classifier (81% accuracy) to some baseline classifiers that would ignore the input data and instead make constant predictions.

What would be the score of a model that always predicts ' >50K'?
What would be the score of a model that always predicts ' <= 50K'?
Is 81% or 82% accuracy a good score for this problem?

Use a DummyClassifier and do a train-test split to evaluate its accuracy on the test set. This link shows a few examples of how to evaluate the performance of these baseline models.

import pandas as pd

df = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = df[target_name].to_numpy()
data = df.drop(columns=[target_name, "fnlwgt"])

from sklearn.compose import make_column_selector as selector

numerical_columns_selector = selector(dtype_include=["int", "float"])
numerical_columns = numerical_columns_selector(data)
data_numeric = data[numerical_columns]

from sklearn.model_selection import train_test_split

data_numeric_train, data_numeric_test, target_train, target_test = \
    train_test_split(data_numeric, target, random_state=0)

from sklearn.dummy import DummyClassifier

high_revenue_clf = DummyClassifier(strategy="constant",
                                   constant=" >50K")
high_revenue_clf.fit(data_numeric_train, target_train)
score = high_revenue_clf.score(data_numeric_test, target_test)
print(f"{score:.3f}")

0.241

low_revenue_clf = DummyClassifier(strategy="constant",
                                  constant=" <=50K")
low_revenue_clf.fit(data_numeric_train, target_train)
score = low_revenue_clf.score(data_numeric_test, target_test)
print(f"{score:.3f}")

0.759

most_freq_revenue_clf = DummyClassifier(strategy="most_frequent")
most_freq_revenue_clf.fit(data_numeric_train, target_train)
score = most_freq_revenue_clf.score(data_numeric_test, target_test)
print(f"{score:.3f}")

0.759

So 81% accuracy is significantly better than 76% which is the score of a baseline model that would always predict the most frequent class which is the low revenue class: " <=50K".

In this dataset, we can see that the target classes are imbalanced: almost 3/4 of the records are people with a revenue below 50K:

df["class"].value_counts()

 <=50K    37155
 >50K     11687
Name: class, dtype: int64

(target == " <=50K").mean()

0.7607182343065395

Scikit-learn tutorial

Solution for Exercise 01¶