The intuition behind R2 and other regression evaluation metrics

Ritesh Agrawal
5 min readMay 6, 2017

There are many metrics for evaluating a regression model. But often they seem cryptic. Below is an attempt to help understand the intuition two often used such metrics: mean/median absolute error and R2 (or the coefficient of determination)

Average Accuracy of the Model (Mean/Median Absolute Error)

Let’s assume you got a model that can predict house prices. Naturally, you won’t trust it unless you evaluate it and establish some confidence in the expected error. So, you feed in features (such as room number, lot size, etc) for a certain house and compare the predicted (say 130K) to its actual (say 120K) price. In this particular case, we can say that the model overestimated the price by 10K. But a single point is not sufficient to make a general claim about the accuracy of expected error for the given model. So we feed in features for another 1000 houses and for each of them compute error, i.e. difference between predicted and actual price).

From descriptive statistics, we know that there are different ways to summarize these 1000 error points. For instance, we can summarize the general tendency of the dataset by mean or median or even draw a boxplot to understand the distribution of error.

Since we are interested in a numerical measure (rather than visualization), using “mean” as a way to summarize all the observed error make sense. Thus we can compute the mean error.

However, there is a problem. What if the error is -10K (i.e under-estimates) for one house and 10K (i.e. over-estimates) for another. Then mean the error will be 0. Intuitively this doesn’t make sense. It makes more sense to say that the expected error is 10K i.e. we operate on absolute error rather than on signed (under/overestimate) error. Thus we got all the components of our first metric, namely Mean Absolute Error. To summarize, its called mean absolute error because:

  1. Error: because we are comparing actual house price to predicted house price
  2. Absolute: because we just think about the error and not whether it is under-predicted or over-predicted
  3. Mean: because we are using “mean” as a way to describe the central tendency of the observed error.

Now, we know that mean is sensitive to outliers. So sometimes instead of mean, we use median and the metric is known as median absolute error. The advantage of “Mean/Median Absolute Error” is that it's easy to make sense of the number. For instance, if the mean absolute error of a model is 20K then we know that if the predicted price is 200K then the actual price is most likely between 180K to 220K.

Can it be better (R2)

Data scientists are not only concerned with quantifying the error but are also interested in determining if the model can be improved. To answer this question let’s first establish the best and the worst models.

Best Model
Theoretically, the best model is a model for which the absolute error is zero for all the test cases. As shown in the graph below, if we draw absolute error on the x-axis and cumulative percentage of houses on the y-axis then a point say (50K, 0.6) indicates that for 60% of houses the absolute error is less than or equal to 50K.

So given this graph how the best model will look like?
Since the absolute error is always zero, the graph will be simply a vertical line starting from 0 on the x-axis extending to 100% on the y-axis.

Worst Model
Don’t confuse the word “worst” with the word “dump”. Typically for building a regression model we have a target variable (house price) and certain features or predictor variables such as a number of rooms, lot size, etc. But what if there are no features available. For instance, the only information provided is house prices for 10K randomly selected houses. We can still build a model simply based on this limited information. For instance, we can compute the mean house price based on the 10K training samples we have. Now our model will simply return this mean value. Let’s say the mean value is 215K. If we ask this model what will be the price of a house with a lot size 5000 sq ft, it will simply return 215K. Let’s call this mean model.

Theoretically, it can be shown that when no other information is the available mean model will minimize error. Intuitively this makes sense as we often tend to use mean value when we have no other information. The graph below indicates how the curve for the mean model will look like.

Determining scope for improvement
From the above graph, we can easily observe few things. First, as our model becomes better, it will move towards the best model and hence the area between the best model and our model will decrease. On the other hand, the area between the worst model and our model will increase. However, the total area i.e area between the best and the worst model remains the same. Let’s call this area to be an improvement opportunity. As our model get’s better, the more of this improvement opportunity area it covers. This is exactly what the R2 metric captures. It indicates what portion of the total improvement opportunity our model covers i.e.

Once we understand the above intuition, it's also easy to understand why often there is the confusion of whether R2 ranges from 0 to 1 (as mentioned in Wikipedia) or from -1 to 1 (as in the sklearn library). If we go by formula 1 in the above graph then R2 will be always positive and between 0 and 1. However, this doesn’t tell where our model is in comparison to the mean model. Implicitly it’s made an assumption that our model will be always better than the mean model and hence will be in between the mean model and the best model.

But in practice, it's possible that our model is worst than the mean model and it falls on the right side of the mean model. In that case

will be bigger than

and hence R2 will be negative.

I hope now we can appreciate the beauty of R2 and understand the intuition behind it.

Originally published at http://ragrawal.wordpress.com on May 6, 2017.

--

--

Ritesh Agrawal

Senior Machine Learning Engineer, Varo Money; Contributor and Maintainer of sklearn-pandas library