π Solution for Exercise M7.02#
We presented different classification metrics in the previous notebook. However, we did not use it with a cross-validation. This exercise aims at practicing and implementing cross-validation.
We will reuse the blood transfusion dataset.
import pandas as pd
blood_transfusion = pd.read_csv("../datasets/blood_transfusion.csv")
data = blood_transfusion.drop(columns="Class")
target = blood_transfusion["Class"]
Note
If you want a deeper overview regarding this dataset, you can refer to the Appendix - Datasets description section at the end of this MOOC.
First, create a decision tree classifier.
# solution
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
Create a StratifiedKFold
cross-validation object. Then use it inside the
cross_val_score
function to evaluate the decision tree. We will first use
the accuracy as a score function. Explicitly use the scoring
parameter of
cross_val_score
to compute the accuracy (even if this is the default score).
Check its documentation to learn how to do that.
# solution
from sklearn.model_selection import cross_val_score, StratifiedKFold
cv = StratifiedKFold(n_splits=10)
scores = cross_val_score(tree, data, target, cv=cv, scoring="accuracy")
print(f"Accuracy score: {scores.mean():.3f} Β± {scores.std():.3f}")
Accuracy score: 0.627 Β± 0.139
Repeat the experiment by computing the balanced_accuracy
.
# solution
scores = cross_val_score(
tree, data, target, cv=cv, scoring="balanced_accuracy"
)
print(f"Balanced accuracy score: {scores.mean():.3f} Β± {scores.std():.3f}")
Balanced accuracy score: 0.500 Β± 0.104
We will now add a bit of complexity. We would like to compute the precision of
our model. However, during the course we saw that we need to mention the
positive label which in our case we consider to be the class donated
.
We will show that computing the precision without providing the positive label will not be supported by scikit-learn because it is indeed ambiguous.
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
try:
scores = cross_val_score(tree, data, target, cv=10, scoring="precision")
except ValueError as exc:
print(exc)
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:978: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 140, in __call__
score = scorer._score(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 380, in _score
y_pred = method_caller(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 90, in _cached_call
result, _ = _get_response_values(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/utils/_response.py", line 207, in _get_response_values
raise ValueError(
ValueError: pos_label=1 is not a valid label: It should be one of ['donated' 'not donated']
warnings.warn(
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:978: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 140, in __call__
score = scorer._score(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 380, in _score
y_pred = method_caller(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 90, in _cached_call
result, _ = _get_response_values(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/utils/_response.py", line 207, in _get_response_values
raise ValueError(
ValueError: pos_label=1 is not a valid label: It should be one of ['donated' 'not donated']
warnings.warn(
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:978: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 140, in __call__
score = scorer._score(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 380, in _score
y_pred = method_caller(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 90, in _cached_call
result, _ = _get_response_values(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/utils/_response.py", line 207, in _get_response_values
raise ValueError(
ValueError: pos_label=1 is not a valid label: It should be one of ['donated' 'not donated']
warnings.warn(
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:978: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 140, in __call__
score = scorer._score(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 380, in _score
y_pred = method_caller(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 90, in _cached_call
result, _ = _get_response_values(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/utils/_response.py", line 207, in _get_response_values
raise ValueError(
ValueError: pos_label=1 is not a valid label: It should be one of ['donated' 'not donated']
warnings.warn(
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:978: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 140, in __call__
score = scorer._score(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 380, in _score
y_pred = method_caller(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 90, in _cached_call
result, _ = _get_response_values(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/utils/_response.py", line 207, in _get_response_values
raise ValueError(
ValueError: pos_label=1 is not a valid label: It should be one of ['donated' 'not donated']
warnings.warn(
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:978: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 140, in __call__
score = scorer._score(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 380, in _score
y_pred = method_caller(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 90, in _cached_call
result, _ = _get_response_values(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/utils/_response.py", line 207, in _get_response_values
raise ValueError(
ValueError: pos_label=1 is not a valid label: It should be one of ['donated' 'not donated']
warnings.warn(
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:978: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 140, in __call__
score = scorer._score(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 380, in _score
y_pred = method_caller(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 90, in _cached_call
result, _ = _get_response_values(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/utils/_response.py", line 207, in _get_response_values
raise ValueError(
ValueError: pos_label=1 is not a valid label: It should be one of ['donated' 'not donated']
warnings.warn(
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:978: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 140, in __call__
score = scorer._score(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 380, in _score
y_pred = method_caller(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 90, in _cached_call
result, _ = _get_response_values(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/utils/_response.py", line 207, in _get_response_values
raise ValueError(
ValueError: pos_label=1 is not a valid label: It should be one of ['donated' 'not donated']
warnings.warn(
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:978: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 140, in __call__
score = scorer._score(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 380, in _score
y_pred = method_caller(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 90, in _cached_call
result, _ = _get_response_values(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/utils/_response.py", line 207, in _get_response_values
raise ValueError(
ValueError: pos_label=1 is not a valid label: It should be one of ['donated' 'not donated']
warnings.warn(
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:978: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 140, in __call__
score = scorer._score(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 380, in _score
y_pred = method_caller(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 90, in _cached_call
result, _ = _get_response_values(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/utils/_response.py", line 207, in _get_response_values
raise ValueError(
ValueError: pos_label=1 is not a valid label: It should be one of ['donated' 'not donated']
warnings.warn(
Tip
We catch the exception with a try
/except
pattern to be able to print it.
We get an exception because the default scorer has its positive label set to
one (pos_label=1
), which is not our case (our positive label is βdonatedβ).
In this case, we need to create a scorer using the scoring function and the
helper function make_scorer
.
So, import sklearn.metrics.make_scorer
and
sklearn.metrics.precision_score
. Check their documentations for more
information. Finally, create a scorer by calling make_scorer
using the score
function precision_score
and pass the extra parameter pos_label="donated"
.
# solution
from sklearn.metrics import make_scorer, precision_score
precision = make_scorer(precision_score, pos_label="donated")
Now, instead of providing the string "precision"
to the scoring
parameter
in the cross_val_score
call, pass the scorer that you created above.
# solution
scores = cross_val_score(tree, data, target, cv=cv, scoring=precision)
print(f"Precision score: {scores.mean():.3f} Β± {scores.std():.3f}")
Precision score: 0.254 Β± 0.183
cross_val_score
will only compute a single score provided to the scoring
parameter. The function cross_validate
allows the computation of multiple
scores by passing a list of string or scorer to the parameter scoring
, which
could be handy.
Import sklearn.model_selection.cross_validate
and compute the accuracy and
balanced accuracy through cross-validation. Plot the cross-validation score
for both metrics using a box plot.
# solution
from sklearn.model_selection import cross_validate
scoring = ["accuracy", "balanced_accuracy"]
scores = cross_validate(tree, data, target, cv=cv, scoring=scoring)
scores
{'fit_time': array([0.00284076, 0.00273967, 0.00280452, 0.00271297, 0.00278354,
0.00270128, 0.00266266, 0.00271749, 0.0027082 , 0.00275779]),
'score_time': array([0.00274086, 0.00279593, 0.00257397, 0.00259328, 0.00260496,
0.00259137, 0.0025537 , 0.00255203, 0.00255942, 0.00260353]),
'test_accuracy': array([0.30666667, 0.50666667, 0.77333333, 0.56 , 0.65333333,
0.64 , 0.72 , 0.77333333, 0.64864865, 0.75675676]),
'test_balanced_accuracy': array([0.42982456, 0.46637427, 0.64181287, 0.40643275, 0.48684211,
0.42105263, 0.5877193 , 0.73684211, 0.4623323 , 0.51186791])}
import pandas as pd
color = {"whiskers": "black", "medians": "black", "caps": "black"}
metrics = pd.DataFrame(
[scores["test_accuracy"], scores["test_balanced_accuracy"]],
index=["Accuracy", "Balanced accuracy"],
).T
import matplotlib.pyplot as plt
metrics.plot.box(vert=False, color=color)
_ = plt.title("Computation of multiple scores using cross_validate")