Evaluation of your predictive model¶

Introduction¶

Machine-learning models rely on optimizing an objective function, by seeking its minimum or maximum. It is important to understand that this objective function is usually decoupled from the evaluation metric that we want to optimize in practice. The objective function serves as a proxy to the evaluation metric. FIXME: add information about a loss function depending of the notebooks presented before the notebook about metrics.

While other notebooks will give insights regarding algorithms and their associated objective functions, in this notebook we will focus on the metrics used to evaluate the performance of a predictive model.

Selecting an evaluation metric will mainly depend on the model chosen to solve our datascience problem.

Classification¶

We can recall that in a classification setting, the target y is categorical rather than continuous. We will use the blood transfusion dataset that will be fetched from OpenML.

We can see that the target y contains 2 categories corresponding to whether or not a subject gave blood or not. We will use a logistic regression classifier to predict this outcome.

First, we split the data into a training and a testing set.

Once our data are split, we can learn a logistic regression classifier solely on the training data, keeping the testing data for the evaluation of the model.

Now, that our classifier is trained, we can provide some information about a subject and the classifier can predict whether or not the subject will donate blood.

Let’s create a synthetic sample corresponding to the following potential new donor: he/she donated blood 6 month ago and gave twice blood in the past for a total of 1000 c.c. He/she gave blood for the first time 20 months ago.

With these information, our classifier predicted that this synthetic subject is more likely to not donate blood. However, we have no possibility to ensure if the prediction is correct or not. That’s why, we can now use the testing set for this purpose. First, we can predict whether or not a subject will give blood with the help of the trained classifier.

Accuracy as a baseline¶

Now that we have these predictions, we could compare them with the true predictions (sometimes called ground-truth) which we did not use up to now.

In the comparison above, a True value means that the value predicted by our classifier is identical to the real prediction while a False means that our classifier made a mistake. One way to get an overall statistic telling us how good the performance of our classifier are is to compute the number of time our classifier was right and divide it by the number of samples in our set (i.e. taking the mean of correct predictions)

This measure is also known as the accuracy. Here, our classifier is 78% accurate at classifying if subject will give blood. scikit-learn provides a function to compute this metric in the module sklearn.metrics.

Scikit-learn also have a build-in method named score which compute by default the accuracy score.

Confusion matrix and derived metrics¶

The comparison that we did above and the accuracy that we deducted did not take into account which type of error our classifier was doing. The accuracy is an aggregate of the error. However, we might be interested in a lower granularity level to know separately the error for the two following case:

we predicted that a person will give blood but she/he is not;
we predicted that a person will not give blood but she/he is.

The in-diagonal numbers are related to predictions that agree with the true labels while off-diagonal numbers are related to misclassification. Besides, we now know the type of true or erroneous predictions the classifier did:

the top left corner is called true positive (TP) and correspond to a person who gave blood and was predicted as such by the classifier;
the bottom right corner is called the true negative (TN) and correspond to a person who did not gave blood and was predicted as such by the classifier;
the top right corner is called false negative (FN) and correspond to a person who gave blood but was predicted as not giving blood;
the bottom left corner is called false positive (FP) and correspond to a person who did not give blood but was predicted as giving blood.

Once we have split these information, we can compute statistics for highlighting the performance of our classifier in a particular setting. For instance, one could be interested in the fraction of persons who really gave blood when the classifier predicted so or the fraction of people predicted as giving blood among the total population that actually did so.

The former statistic is known as the precision defined as TP / (TP + FP) while the latter statistic is known as the recall defined as TP / (TP + FN) We could, similarly than with the accuracy, manually compute these values. But scikit-learn provides functions to compute these statistics.

These results are in line with what we could see in the confusion matrix. In the left column, more than half of the predictions were corrected leading to a precision above 0.5. However, our classifier mislabeled a lot of persons who gave blood as “not donated” leading to a very low recall of around 0.1.

The precision and recall can be combined in a single score called the F1 score (which is the harmonic mean of precision and recall)

The issue of class imbalance¶

At this stage, we could ask ourself a reasonable question. While the accuracy did not look bad (i.e. 77%), the F1 score is relatively low (i.e. 21%).

As we mentioned, precision and recall only focus on the positive label while the accuracy is taking both aspects into account. In addition, we omit to look at the ratio class occurrence. We could check this ratio in the training set.

So we can observed that the positive class 'donated' is only 24% of the total number of instances. The good accuracy of our classifier is then linked to its capability of predicting correctly the negative class 'not donated' which could be relevant or not depending of the application. We can illustrate the issue using a dummy classifier as a baseline.

This dummy classifier will always predict the negative class 'not donated'. We obtain an accuracy score of 76%. Therefore, it means that this classifier, without learning anything from the data X is capable of predicting as accurately than our logistic regression. 76% represents the baseline that any classifier should overperform to not be a random classifier.

The problem illustrated above is also known as the class imbalance problem where the accuracy should not be used. In this case, one should either use the precision, recall, or F1 score as presented above or the balanced accuracy score instead of the accuracy.

The balanced accuracy is equivalent to the accuracy in the context of balanced classes. It is defined as the average recall obtained on each class.

Evaluation with different probability threshold¶

All statistics that we presented up to now rely on classifier.predict which provide the most likely label. However, we don’t use the probability associated with this prediction or in other words how sure are the classifier confident about this prediction. By default, the prediction of a classifier correspons to a thresholding at a 0.5 probability, in a binary classification problem. We can quickly check this relationship with the classifier that we trained.

The default decision threshold (0.5) might not be the best threshold leading to optimal performance of our classifier. In this case, one can vary the decision threshold and therefore the underlying prediction and compute the same statistic than presented earlier. Usually, two metrics are computed and reported as a curve. Each metric is belonging to a graph axis and a point on the graph corresponds to a specific decision threshold. Let’s start by computing the precision-recall curve.

On this curve, each blue dot correspond to a certain level of probability which we used as a decision threshold. We can see that by varying this decision threshold, we get different compromise precision vs. recall.

A perfect classifier is expected to have a precision at 1 even when varying the recall. A metric characterizing the curve is linked to the area under the curve (AUC), named averaged precision. With a ideal classifier, the average precision will be 1.

While the precision and recall metric focuses on the positive class, one might be interested into the compromise between performance to discriminate positive and negative classes. The statistics used in this case are the sensitivity and specificity. The sensitivity is just another denomination for recall. However, the specificity measures the proportion of well classified samples from the negative class defined as TN / (TN + FP). Similarly to the precision-recall curve, sensitivity and specificity are reported with a curve called the receiver operating characteristic (ROC) curve. We will show such curve:

This curve is built with the same principle than with the precision-recall curve: we vary the probability threshold to compute “hard” prediction and compute the metrics. As with the precision-recall curve as well, we can compute the area under the ROC (ROC-AUC) to characterize the performance of our classifier. However, this is important to observer that the lower bound of the ROC-AUC is 0.5. Indeed, we represented the performance of a dummy classifier (i.e. green dashed line) to show that the worse performance obtained will always be above this line.

Link between confusion matrix, precision-recall curve and ROC curve¶

TODO: ipywidgets to play with interactive curve

Regression¶

Unlike in the classification problem, the target y is a continuous variable in regression problem. Therefore, the classification metrics can be used to evaluate the performance of a model. Instead, there exists a set of metric dedicated to regression.

Our problem can be formulated as follow: we would like to infer the number of bike rentals from data related to the current day. The number of bike rentals is a number that can vary in the interval [0, infinity) (if the number of bike available is infinite). As in the previous section, we will train a model and we will evaluate its performance by introducing the different regression metrics.

First, we split the data into a training and a testing set.

Baseline model¶

We will use a random forest as a model. However, we first need to check the type of data that we are dealing with:

While some features are numeric, some have been tagged as category. These features need to be encoded in a proper way such that our random forest can deal with them. The simplest solution is to use an OrdinalEncoder. Regarding, the numerical features, we don’t need to do anything. Thus, we will create a preprocessing steps to take care about this encoding.

Just to have some insights about the preprocessing, we manually preprocessed the training data and we can observe that the original strings were encoded with numbers. We can now create our model.

As for classifiers, regressors have a score method which will compute the :math:R^2 score (also known as the coefficient of determination) by default:

The :math:R^2 score represents the proportion of variance of the target explained by the independent variables in the model. The best score possible is 1 but there is no lower bound. However, a model which would predict the expected value of the target would get a score of 0.

The :math:R^2 score gives insights regarding the goodness of fit of the model. However, this score cannot be compared from one dataset to another and the value obtained does not have a meaningful interpretation regarding the original unit of the target. If we want to get such interpretable score, we will be interested into the median or mean absolute error.

By computing the mean absolute error, we can interpret that our model is predicting in average 507 bike rentals away from the truth. The mean can be impacted by large error while for some application, we would like to discard them and we can in this case opt for the median absolute error.

In this case, our model make an error of 405 bikes. FIXME: not sure how to introduce the mean_squared_error.

In addition of metrics, we can visually represent the results by plotting the predicted values versus the true values.

On this plot, the perfect prediction will lay on the diagonal line. This plot allows to detect if the model have a specific regime where our model does not work as expected or has some kinda of bias.

Let’s take an example using the house prices in Ames.

We will fit a ridge regressor on the data and plot the prediction versus the actual values.

On this plot, we see that for the large “True values”, our model tend to under-estimate the price of the house. Typically, this issue arises when the target to predict does not follow a normal distribution and the model could benefit of an intermediate target transformation.

Thus, once we transformed the target, we see that we corrected some of the high values.

Summary¶

In this notebook, we presented the metrics and plots useful to evaluate and get insights about models. We both focus on regression and classification problems.

Scikit-learn tutorial