ABSTRACT OF Professor Casadio's TALK

 

Machine Learning-Approaches, Protein Structure Prediction and Structural Genomimcs

Friday September 19, 2003, 14:50-15:40

As a result of large sequencing projects, data banks of protein sequences and structures are growing rapidly. The number of sequences is however one order of magnitude larger than the number of structures known at atomic level and this is so in spite of the efforts in accelerating processes aiming at the resolution of protein structure. Tools have been developed in order to bridge the gap between sequence and protein 3D structure based on the notion that information is to be retrieved from the data bases and that knowledge-based methods can help in approaching a solution of the protein folding problem. To this aim our group has implemented neural network based predictors capable of performing with some success in different tasks, including predictions of the secondary structure of globular and membrane proteins, of the topology of membrane proteins and porins, of stable alpha helical segments suited for protein design. Moreover we have developed methods for predicting contact maps in proteins and the probability of finding a cysteine in a disulphide bridge, tools which can contribute to the goal of predicting the 3D structure starting from the sequence (the so called "ab initio" prediction). All our predictors take advantage of evolution information derived from the structural alignments of homologous proteins and derived from the sequence and structure databases. A hybrid system based on neural networks and hidden Markov models seems particularly successful in predicting the bonding state of cysteines in proteins scoring as high as 88% and well predicting 84% of the proteins of the testing set. Also with neural networks it is possible to predict protein protein interaction patches starting from the protein 3D structure. This adds to protein-protein interaction networks. Recently our predictors have been integrated in a package (HUNTER) capable of performing genome-wide analysis of protein sequences and annotating them on the basis of characteristic structural features. HUNTER scores as high as 96% when it is tested on some 2920 well or partially annotated proteins of E.coli. When the remaining non-annotated proteins are filtered (1253), Hunter predicts 154 new membrane proteins, 18 of which are classified as outer membrane proteins. In E.coli 0157:H7 (a pathogenic strain of E.coli) we filtered 1901 non-annotated proteins. Our analysis classifies 1564 globular chains, 327 inner membrane proteins and 10 outer membrane proteins. With HUNTER new membrane proteins are added to the list of putative membrane proteins of Gram-negative bacteria. The content of outer membrane proteins per genome (9 are analyzed) ranges from 1.5 to 2.4% and it is one order of magnitude lower than that of inner membrane proteins. The finding is particularly relevant when it is considered that this is the first large-scale analysis based on validated tools that can predict the content of outer membrane proteins in a genome and can allow cross comparison of the same protein type between different species.