Never do this mistake when using Feature Selection

Esben Jannik Bjerrum/ February 19, 2016/ Blog, Cheminformatics, Machine Learning/ 3 comments

Wrong Feature Selection

Wrong way of doing feature selection.

Feature selection is a powerful way of reducing the complexity of a machine learning or statistical model. But feature selection must be done in the right way, or you may end up cheating yourself or others. Later we will also see if green jelly beans are truly associated with acne.
When working with machine learning, it may be a good idea to reduce the number of used features in the model. Maybe the features or measurements are costly to achieve and using fewer features in the model can ease interpretation. Additionally the model will use less computational resources and time of execution.
There is however an inherent danger of (un?)willingly introduce a bias into the data modelling. This example script demonstrates the effect. First some imports is done. Numpy for data arrays and numeric computing. Some scikit-learn (sklearn) modules for feature selection and model building and matplotlib for plotting.

import numpy as np
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import Ridge
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt

Numerical python is used to generate some data to play with.

# Some dataset to play with
X = np.random.randn(80,10000)
y = np.random.randn(80)

For feature selection I use the sklearn utilities. I use the SelectKbest, which selects the specified number of features based on the passed test, here the f_regression test also from the sklearn package. All features are evaluated each on their own with the test and ranked according to the f statistical regression test. The 40 best features are selected and a new X data block is build. All with one line of code.

#Feature Selection
X_selected = SelectKBest(f_regression, k=40).fit_transform(X, y)

The next steps more or less follow the same approach to model building as previously described. The dataset is divided into a test and a training set, followed by a tuning of the alpha hyper parameter for the ridge regression model. To keep things simple I just use the test set for tuning, instead of cross validation.

#Divide into training and test-set
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.50, random_state=42)
#Tune Alpha
#prepare scoring lists
fitscores = []
predictscores = []
#prepare a log spaced list of alpha values to test
alphas = np.logspace(-1, 4, num=25)
#Iterate through alphas and fit with Ridge Regression
for alpha in alphas:
	estimator = Ridge(alpha = alpha),y_train)

This results in three lists, one with the alphas tests, one with the scores for the fit of the training set and one with the scores of the test set prediction. To get a visual overview a plot is useful.

#make a plot
ax = plt.gca()
ax.plot(alphas, fitscores,'g', label = 'Train')
ax.plot(alphas, predictscores,'b', label = 'Test')
#Set limits and titles
plt.ylabel('Correlation Coefficient')
plt.title('Tune regularization')

[image source=”” alt=”Tuning” width=”640″ height=”480″ class=”alignright size-full wp-image-1182″ border=”true”]
The plot nicely follows what is expected. Some degree of regularization gives a model with a not too large over-fit to the data. An alpha around 45 seems adequate.
A final model is produced from the entire training set and the prediction of the test set is compared to the actual values with an additional plot.

#Build a model on the training set and test it.
ridgeEstimator = Ridge(alpha = 45),y_train)
plt.plot(y_train,ridgeEstimator.predict(X_train),'go', label = 'Train')
plt.plot(y_test,ridgeEstimator.predict(X_test),'b*', label = 'Test')
plt.ylabel('Predicted Y')
plt.title('Prediction Plot')
plt = watermark(plt) #Can be skipped

[image source=”” alt=”Prediction” width=”640″ height=”480″ class=”alignright size-full wp-image-1181″ border=”true”]
This looks like a fairly good model. But how can this be? The data was made in a completely random way. Is there a mistake in the random module of Numerical Python? Should I have used cross validation on the training set to tune the hyper-parameter? The first is highly unlikely and the second could give a small effect, but the true problem is the way feature selection was done. Is there then an error in the way the sklearn module works? No, the true reason is related to a sampling effect as also featured in this comic strip by Randall Munroe: Jelly Beans Causes Acne (
If enough random data is sampled and tested with statistical tools that do not take the number of testings into account, a wrong understanding of the correlations in the dataset will result. The mistake in the example script above is that the feature selection is done on the entire dataset and not just using the training set. This way 40 features was selected out a large pool that by chance had a correlation to the y values. Later splitting and model building will then re find this correlation no matter how the dataset is split and tested.
This effect may not just be active when using automated feature selection as above, but also as the human modeler tries to optimize the models. The more learners, the more features, the more different ways of producing fingerprints from molecules are tested, the more likely one is to find a combination that by chance works for the dataset at hand. This is why is it always a good idea to split out a validation set at the very beginning of the modeling exercise or to obtain a brand new validation set after modeling by procuring new data or do new measurements in the lab.
Best Regards
Esben Jannik Bjerrum
I leave it as an exercise to interested readers to switch the order of feature selection and train-test splitting.
The entire script for convenience:

Share this Post


  1. Hello Esben,
    First of all, thank you for your wonderful blog. It has been very useful for my research – I am doing machine learning / QSPR on molecules.
    It is essentially a form of overfitting. I’ve seen the precise problem you discussed mentioned in the QSPR lit as “selection bias”. One way to check for it is y-scrambling or y-randomization, but the math on how to properly apply (which most people don’t follow) is non trivial.
    The proper thing is to hold out some data in a validation set and keep it away from the entire modeling process (descriptor / feature selection , model selection, model buidling). Most people who do descriptor selection and QSPR work with ML do this but I’ve seen some publications that neglect to.

    1. Hi Dan, Thank you for commenting and its nice that you find my blog useful. It is indeed important to keep out an external validation or test set that is kept separate from the entire modeling process. Including the feature selection, hyperparameter tuning, selection of model, creation of custom features and so forth. Then the rest of the data can be used in cross validation or with a fixed test/validation set, giving a three set split train/val(or CV)/test.
      However a lot of the papers I have received for review (and a lot of the published ones) fail to do this or don’t give details in methods, unintentionally or not. I find the most convincing method is to procure new data post modeling from the lab for the final testing or engage in blinded competitions/challenges such as cafasp or D3R grand challange. Do you know of any such upcoming challenges within QSPR?

      1. I am not aware of any competitions or challenges going on right now. If you find any, I would be interested to know about them.
        A quick look on reveals that Merck did a challenge there about 5 years ago:
        There is also some molecular coordinate data and forcefield energies to play around with on Kaggle, if you are interested in coordinate-based featurization techniques (like the Coulomb matrix / bag of bonds) :

Leave a Comment

Your email address will not be published.