|
Journal of Computational Biology
Linear Regression Models for Solvent Accessibility Prediction in Proteins
To cite this article:
Michael Wagner, Rafaℓ Adamczak, Aleksey Porollo, Jarosℓaw Meller.
Journal of Computational Biology.
April 2005,
12(3): 355-369.
doi:10.1089/cmb.2005.12.355.
Published in Volume: 12 Issue 3: April 21, 2005
Michael Wagner Division of Biomedical Informatics, Cincinnati Children's Hospital Research Foundation, 3333 Burnet Avenue, Cincinnati, OH 45229. Rafaℓ Adamczak Division of Biomedical Informatics, Cincinnati Children's Hospital Research Foundation, 3333 Burnet Avenue, Cincinnati, OH 45229. Aleksey Porollo Division of Biomedical Informatics, Cincinnati Children's Hospital Research Foundation, 3333 Burnet Avenue, Cincinnati, OH 45229. Jarosℓaw Meller Division of Biomedical Informatics, Cincinnati Children's Hospital Research Foundation, 3333 Burnet Avenue, Cincinnati, OH 45229. Department of Informatics, Nicholas Copernicus University, 87-100 Toruń, Poland. The relative solvent accessibility (RSA) of an amino acid residue in a protein structure is a real number that represents the solvent exposed surface area of this residue in relative terms. The problem of predicting the RSA from the primary amino acid sequence can therefore be cast as a regression problem. Nevertheless, RSA prediction has so far typically been cast as a classification problem. Consequently, various machine learning techniques have been used within the classification framework to predict whether a given amino acid exceeds some (arbitrary) RSA threshold and would thus be predicted to be "exposed," as opposed to "buried." We have recently developed novel methods for RSA prediction using nonlinear regression techniques which provide accurate estimates of the real-valued RSA and outperform classification-based approaches with respect to commonly used two-class projections. However, while their performance seems to provide a significant improvement over previously published approaches, these Neural Network (NN) based methods are computationally expensive to train and involve several thousand parameters. In this work, we develop alternative regression models for RSA prediction which are computationally much less expensive, involve orders-of-magnitude fewer parameters, and are still competitive in terms of prediction quality. In particular, we investigate several regression models for RSA prediction using linear L 1-support vector regression (SVR) approaches as well as standard linear least squares (LS) regression. Using rigorously derived validation sets of protein structures and extensive cross-validation analysis, we compare the performance of the SVR with that of LS regression and NN-based methods. In particular, we show that the flexibility of the SVR (as encoded by metaparameters such as the error insensitivity and the error penalization terms) can be very beneficial to optimize the prediction accuracy for buried residues. We conclude that the simple and computationally much more efficient linear SVR performs comparably to nonlinear models and thus can be used in order to facilitate further attempts to design more accurate RSA prediction methods, with applications to fold recognition and de novo protein structure prediction methods.  This paper was cited by:Protein function annotation from sequence: prediction of residues interacting with RNA R. V. Spriggs, Y. Murakami, H. Nakamura, S. Jones Bioinformatics. Jul 2009, Vol. 25, No. 12: 1492-1497 CrossRef Pathogenic or not? And if so, then how? Studying the effects of missense mutations using bioinformatics methods Janita Thusberg, Mauno Vihinen Human Mutation. Jun 2009, Vol. 30, No. 5: 703-714 CrossRef The Ser/Thr/Tyr phosphoproteome of Lactococcus lactis IL1403 reveals multiply phosphorylated proteins Boumediene Soufi, Florian Gnad, Peter Ruhdal Jensen, Dina Petranovic, Matthias Mann, Ivan Mijakovic, Boris Macek PROTEOMICS. Oct 2008, Vol. 8, No. 17: 3486-3493 CrossRef A novel computational and structural analysis of nsSNPs in CFTR gene C. George Priya Doss, R. Rajasekaran, C. Sudandiradoss, K. Ramanathan, R. Purohit, R. Sethumadhavan Genomic Medicine. Feb 2008, Vol. 2, No. 1-2: 23-32 CrossRef Prediction-based fingerprints of protein–protein interactions Aleksey Porollo, Jarosław Meller Proteins: Structure, Function, and Bioinformatics. Mar 2007, Vol. 66, No. 3: 630-645 CrossRef Two-stage support vector regression approach for predicting accessible surface areas of amino acids Minh N. Nguyen, Jagath C. Rajapakse Proteins: Structure, Function, and Bioinformatics. Jun 2006, Vol. 63, No. 3: 542-550 CrossRef Combining prediction of secondary structure and solvent accessibility in proteins Rafał Adamczak, Aleksey Porollo, Jarosław Meller Proteins: Structure, Function, and Bioinformatics. Jun 2005, Vol. 59, No. 3: 467-475 CrossRef
|
|