Exercise 01¶

The goal of this exercise is to compare the performance of our classifier (81% accuracy) to some baseline classifiers that would ignore the input data and instead make constant predictions.

What would be the score of a model that always predicts ' >50K'?
What would be the score of a model that always predicts ' <= 50K'?
Is 81% or 82% accuracy a good score for this problem?

Use a DummyClassifier and do a train-test split to evaluate its accuracy on the test set. This link shows a few examples of how to evaluate the performance of these baseline models.

import pandas as pd

df = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = df[target_name].to_numpy()
data = df.drop(columns=[target_name, "fnlwgt"])

from sklearn.compose import make_column_selector as selector

numerical_columns_selector = selector(dtype_include=["int", "float"])
numerical_columns = numerical_columns_selector(data)
data_numeric = data[numerical_columns]

from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

# TODO: write me!

Scikit-learn tutorial

Exercise 01¶