Do Not Use R^2 for Cross Validation

Using the notation from Wikipedia:

The coefficient of determination is usually evaluated on the train set. If you evaluate it on the test data, weird things can happen (especially when a test set is small). Let’s consider the $ R^2 $ for one fold that contains two observations, where the $ y $ values by pure chance are VERY close to each other.

\[ R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} \]

$ SS_{} $ is calculated by

\[ \sum_{i=1}^{2} (y_i - \bar{y})^2 \]

where $\bar{y}$ is the mean of those two observations (at least this is how it is done in scikit-learn). Since the two observations are VERY close to each other, the $ SS_{} $ gets VERY small, while the test-residuals (and hence $ SS_{} $) might still be moderately high. This way, $ R^2 $ can be VERY low. By putting the two $ y $-values arbitrarily close to each other, you can make the $ R^2 $ for this test-fold go to $-\infty$. Note that those $ y $-values are not outliers, but just very close to each other.

Minimal example:

$ n=4 $, $ k=2 $, estimator = mean, $ $ are the predictions from the other fold using the mean

fold	$ y $	$ $
1	4	5.0005
1	7	5.0005
2	5	5.5
2	5.001	5.5

Consider fold 2: \[ SS_{\text{res}} = 0.5^2 + 0.499^2 = 0.499 \] \[ SS_{\text{tot}} = (5 - 5.0005)^2 + (5.001 - 5.0005)^2 = 5 \times 10^{-7} \] \[ R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = -998001 \]

Alternatives:

Use the (R)MSE
Use the MAE
“pseudo R-squared”: 1 - mse / Var(y) (c.f. the R-implementation of randomForest)