Do Not Use \(R^2\) for Cross Validation

Author

Lukas Graz

Published

July 17, 2024

Using the notation from Wikipedia:

The coefficient of determination is usually evaluated on the train set. If you evaluate it on the test data, weird things can happen (especially when a test set is small). Let’s consider the $ R^2 $ for one fold that contains two observations, where the $ y $ values by pure chance are VERY close to each other.

\[ R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} \]

$ SS_{} $ is calculated by

\[ \sum_{i=1}^{2} (y_i - \bar{y})^2 \]

where \(\bar{y}\) is the mean of those two observations (at least this is how it is done in scikit-learn). Since the two observations are VERY close to each other, the $ SS_{} $ gets VERY small, while the test-residuals (and hence $ SS_{} $) might still be moderately high. This way, $ R^2 $ can be VERY low. By putting the two $ y $-values arbitrarily close to each other, you can make the $ R^2 $ for this test-fold go to \(-\infty\). Note that those $ y $-values are not outliers, but just very close to each other.

Minimal example:

$ n=4 $, $ k=2 $, estimator = mean, $ $ are the predictions from the other fold using the mean

fold $ y $ $ $
1 4 5.0005
1 7 5.0005
2 5 5.5
2 5.001 5.5

Consider fold 2: \[ SS_{\text{res}} = 0.5^2 + 0.499^2 = 0.499 \] \[ SS_{\text{tot}} = (5 - 5.0005)^2 + (5.001 - 5.0005)^2 = 5 \times 10^{-7} \] \[ R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = -998001 \]

Alternatives:

  • Use the (R)MSE
  • Use the MAE
  • “pseudo R-squared”: 1 - mse / Var(y) (c.f. the R-implementation of randomForest)