Several text files containing different presentations of the experimental and predicted binding data are available for download. One file contains the complete experimental binding data without predictions. The "cv" column indicates the cross-validation round this set was used in (values between 0-4). Another .zip archive contains a set of training and testing sets used in the cross-validation. Here the filenames indicate the data contained in them: MHC_SPECIES - MHC_ALLELE - PEPTIDE_LENGTH - CV and either "train" or "test". The "test" files also contain the prediction made by the SMM / ARB and ANN methods. Finally, there is a set of files containing all experimental data used and predictions made by the ANN, ARB and SMM methods.
Important update: On August 1st 2006, the datasets above were updated as follows: 1) the prediction scores given for the ANN predictions are now IC50 values instead of log(IC50) values, to be consistent with the ARB and SMM prediction scores. 2) the 'indivdidual_datasets.zip' file previously incorrectly identified ANN prediction scores as SMM prediction scores, and failed to list ANN predictions at all. The modified datasets should now accurately reflect the data used in the publication. We want to thank Can Kesmir for pointing out these problems.
Brett Trost from the University of Saskatchewan has written a Perl script that extract subsets of data with certain characteristics