Cross-validation: The right way and the wrong way

Elan Ding
Modified: June 14, 2018

In this post, I will briefly discuss a commonly used technique to evaluate a machine learning model, k-fold cross validation. Tweaking the hyperparameters of a model in order to obtain minimal error is essentially "leaking" the information of the test set. In order to reduce the risk of overfitting on the test set, a natural approach is to split the dataset into three portions: a training set, a test set, and a new validation set. With this set up, the model first trains on the training set, then is validated - including the tweaking of hyperparameters - using the validation set. After the model becomes successful, final evaluation can be done on the test set.

However, the problem is, by partitioning the data set into three parts, the number of samples is reduced, increasing the variance of our prediction. So here comes k-fold cross validation. The process is as follows.

  • The data is evenly divided into $k$ folds, and one fold is used for the test data
  • The model is fitted $k$ times. Each time we leave one fold out as the test data.
  • The performance of the model is measured by the average of the values computed during the $k$ fits.

Example of k-fold cross-validation

We demonstrate an example of doing k-fold cross-validition on the famous Iris dataset.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

sklearn actually has the data built-in. We simply need the following code to call it.

In [2]:
iris = datasets.load_iris()

The simplest way to use cross-validation is to use the cross_val_score function. This function returns an error score between 0 and 1, such that a score closer to 1 represents a better model.

In [3]:
# Inside () we can tweak the hyperparameters, but we will not do it here.
logmodel = LogisticRegression()
In [4]:
scores = cross_val_score(
    logmodel, iris.data, iris.target, cv=5, scoring='f1_macro')
In [5]:
print("The 95% C.I. of score: {:.3} (+/-) {:.3}".format(
    scores.mean(), scores.std()*2))
The 95% C.I. of score: 0.96 (+/-) 0.0792

From sklearn.metrics documentation, the f1_macro score is calculated as

$$ F_1 = \frac{2(\text{precision } \times \text{ recall})}{\text{precision } + \text{ recall}} $$

For more information, take a look at this documentation.

That was easy! However, as mentioned in Chapter 5 of ISRL, a common mistake that occurs, even in some high profile genomics journals, is that people often tweak the model using the entire dataset before applying cross-validation, leading to biased performance scores.

Cross-Validation: The Wrong Way

I am going to show you an example of the wrong way to do cross-validation. First, I am going to create a random data set with a huge number of features, and I will make the response independent of the outcome, so the true testing error is 50%. Think of this as a genomics study where there are thousands of gene expressions (features) and relatively small number of data points.

First I will generate a data set of 10000 features and only 20 data points.

In [6]:
data = np.random.rand(20,10000)

Next, I will generate the target label as a binary outcome that is completely random, so that the true testing error is 50%.

In [7]:
target = np.random.randint(0,2,20)

We have ginormous number of features, so it is normal to want to reduce them by selecting only the most important ones. First we run a logistic regression through the dataset.

In [8]:
glm = LogisticRegression()
In [9]:
glm.fit(data, target)
Out[9]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

We select the most important features by using numpy's conditional selecting. Basically I only choose the features whose coefficients from logistic regression exceeds 0.025. As you can see from below, there are only 5 of them. Big reduction!

In [10]:
feature_select = glm.coef_[0]>0.025
sum(feature_select)
Out[10]:
5

We define our new trimmed data set after feature reduction.

In [11]:
newdata = data[:, feature_select]

Let's find the 5-fold cross-validation score.

In [12]:
scores = cross_val_score(
    glm, newdata, target, cv=5, scoring='f1_macro')
In [13]:
scores.mean()
Out[13]:
0.9333333333333333

Unbelievable! It looks like the classifier is almost perfect, but in fact we only used random numbers for the class labels. This is definitely wrong.

Cross-validation: The right way

Instead of reducing the features first, we split our data set into 5 folds first.

In [14]:
X_train, X_test, y_train, y_test = train_test_split(data,
                                                    target,
                                                    test_size=0.2)

Now on the other 4 folds (while leaving one fold out), we fit our model and determine which features are the most important.

In [15]:
glm2 = LogisticRegression()
In [16]:
glm2.fit(X_train, y_train)
Out[16]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [17]:
feature_select2 = glm2.coef_[0]>0.022
sum(feature_select2)
Out[17]:
7
In [18]:
X_train2 = X_train[:, feature_select2]
In [19]:
X_test2 = X_test[:, feature_select2]
In [20]:
glm2.fit(X_train2, y_train)
Out[20]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [21]:
glm2.score(X_test2, y_test)
Out[21]:
0.25

Wow! Here our score is 0.25, which is a lot worse than expected.

So why is this the case? If we trim the features first before splitting the data into folds, we are essentially using the information of the test set! This is the leak of information that I mentioned in the beginning. Instead, we should split the data first, and then do whatever we want on the training data afterwards.