J. Cheng and P. Baldi. A Machine Learning Information Retrieval Approach to Protein Fold Recognition. Bioinformatics, vol. 22, no. 12, pp. 1456-1463, 2006.
The whole feature dataset generated from Lindahl's dataset can be downloaded here.
The 54 Selected Similarity Features Ranked by Information Gains
| Feature Name | Information Gain
|
|---|
| HHSearch score | 0.0375
|
|---|
| COMPASS evalue | 0.0370
|
|---|
| PRC reverse score on chk profile | 0.0354
|
|---|
| PRC reverse score on HMM profile | 0.0341
|
|---|
| HMMer pfam evalue | 0.0287
|
|---|
| dot product of SS and SA vectors | 0.0266
|
|---|
| HMMer search evalue | 0.0264
|
|---|
| SS match ratio | 0.0263
|
|---|
| correlation of SS and SA vectors | 0.0263
|
|---|
| PRC simple score on HMM profile | 0.0248
|
|---|
| cosine of SS and SA vectors | 0.0246
|
|---|
| Gaussian kernel on SS and SA vectors | 0.0237
|
|---|
| COMPASS score | 0.0235
|
|---|
| PRC coemis score on HMM profile | 0.022
|
|---|
| PSI-BLAST evalue | 0.0205
|
|---|
| IMPALA evalue | 0.0181
|
|---|
| RPS-BLAST evalue | 0.0180
|
|---|
| SA match ratio | 0.0154
|
|---|
| cosine of residue contact num (8AA) | 0.0150
|
|---|
| HMMer search score | 0.0142
|
|---|
| cosine of residue contact num (12AA) | 0.0141
|
|---|
| PRC simple score on chk profile | 0.0140
|
|---|
| normalized length of Palign alignment | 0.0135
|
|---|
| normalized contact probability (8AA) | 0.0132
|
|---|
| Gaussian kernel of sequence dimer composition | 0.0121
|
|---|
| correlation of residue contact num (8AA) | 0.0120
|
|---|
| cosine of residue contact order (8AA) | 0.0117
|
|---|
| correlation of family mononer composition | 0.0116
|
|---|
| correlation of family dimer composition | 0.0116
|
|---|
| PSI-BLAST alignment score | 0.0116
|
|---|
| Gaussian kernel of family dimer composition | 0.0115
|
|---|
| cosine of family dimer composition | 0.0113
|
|---|
| RPS-BLAST alignment score | 0.0113
|
|---|
| cosine of family monomer composition | 0.0112
|
|---|
| IMPALA alignment length | 0.0111
|
|---|
| RPS-BLAST alignment length | 0.0111
|
|---|
| Gaussian kernel of family monomer composition | 0.0110
|
|---|
| correlation of vectors of residue contact order | 0.0109
|
|---|
| IMPALA alignment score | 0.0109
|
|---|
| Palign alignment score | 0.0102
|
|---|
| normalized beta residue pairing probability | 0.0100
|
|---|
| PRC coemis score on chk profile | 0.0091
|
|---|
| correlation of residue contact num (12AA) | 0.0083
|
|---|
| correlation of residue contact order (8AA) | 0.0074
|
|---|
| cosine of residue contact order (12AA) | 0.0072
|
|---|
| correlation of residue contact order (12AA) | 0.0068
|
|---|
| cosine of sequence dimer composition | 0.0065
|
|---|
| normalized contact probability (12AA) | 0.0058
|
|---|
| Clustalw profile alignment score | 0.0050
|
|---|
| cosine of sequence monomer composition | 0.0033
|
|---|
| Gaussian kernel of sequence monomer composition | 0.0033
|
|---|
| correlation of sequence monomer composition | 0.0027
|
|---|
| correlation of family dimer composition | 0.0022
|
|---|
| Clustalw sequence alignment score | 0.0010
|
|---|