How can genetics improve your learning

23.02.2007 16:14

Predictions for the protein factory

Mirjam Kaplow M.A. PR + marketing
Fraunhofer Institute for Computer Architecture and Software Technology FIRST

The gene search engine mSplicer can determine protein-coding areas on the genes of the nematode C. elegans 40% more precisely than previous methods

It is still a vision: to filter out exactly those sections from the roughly three billion letters of the human genome that are responsible for the construction of proteins. What is still in the future for the human genome has now been achieved by scientists from the Fraunhofer and Max Planck Society for the genome of the nematode Caenorhabditis elegans. H. Recognize protein-coding and non-coding sections. The results of the cooperation project will be published on February 23, 2007 in the journal PLoS Computational Biology:

The one millimeter long Caenorhabditis elegans is one of the best-studied organisms in the world. Its genome has been fully sequenced since 1998. Nevertheless, the annotation of the genome, i. H. the localization of its genes and the determination of the corresponding proteins is far from complete. It is continuously revised and completed ( The aim of the research project is to improve the existing annotation of the nematode, which has not yet been fully proven by experiments. To do this, the researchers chose modern machine learning methods. With their help, exons and introns were to be identified in the genetic information of the nematode. The results of the research show that machine learning methods deliver results that are 40% more precise than conventional methods and, in particular, than the annotation valid at the time of the experiments (Wormbase WS120). Machine learning processes can thus contribute significantly to improving existing annotations not only in C. elegans but also in other organisms and considerably accelerate the correct decoding of genetic information.

Method and procedure

In order to substantiate their results, the scientists proceeded in several steps: First, the algorithms used were trained on the basis of already decoded mRNA sequences. mRNA molecules (mRNA = messenger ribonucleic acid) transport the genetic information of the DNA and encode the corresponding proteins. During the training, the algorithms learn the patterns for translating DNA into mRNA. These patterns help to distinguish the different parts of the gene sequence from one another. The recognition of the boundaries between exons and introns, the so-called splice points, plays a crucial role.

After a training phase, the algorithms were used to predict finished mRNA from DNA and the results were compared with existing databases. MSplicer was able to correctly predict all exons and introns with an accuracy of up to 95%.

It was noticeable that the results only matched the existing annotation of the C. elegans genome in up to 50%. An evaluation of the Wormbase annotation version WS 120 with the help of later available information (based on Wormbase version WS 150) confirmed that WS 120 was imprecise in 18% of the examined cases, while mSplicer did not translate exactly 10-13% of the cases. In addition, biological laboratory experiments with 20 genes, in which WS 120 and mSplicer differed greatly from one another, prove the superiority of the algorithmic process. It provided correct predictions in 75% of all cases, while the existing annotation was not correct in any of the cases examined.

Based on the results, a new C. elegans annotation was developed. She is on the web
Available for download at

In a further step, mSplicer was compared with two other state-of-the-art methods for predicting exons and introns: SNAP and ExonHunter. These methods are based on so-called generative models that attempt to model the structure of the data examined. mSplicer, on the other hand, is based on discriminative methods: the algorithm learns "the difference" between correct and incorrect predictions and differentiates them using a separation function. Depending on the selection of the underlying sequences, SNAP and ExonHunter achieved an accuracy in predicting exons and introns of only 82.6% and 90.2%, respectively. The newly developed mSplicer method can achieve an accuracy of 95.2%.

mSplicer has been developed since 2003 as part of a cooperation project between the Fraunhofer Society and the Max Planck Society. The focus is on closer links between basic and applied research.

The responsible project manager at Fraunhofer FIRST, Prof. Dr. Klaus-Robert Müller, from the Max Planck Institute for Biological Cybernetics, Prof. Dr. Bernhard Schölkopf, and from the Friedrich Miescher Laboratory, Dr. Gunnar Rätsch.

Press contact:
Mirjam Kaplow, Head of Institute Communication Fraunhofer FIRST;
Tel .: 030 / 6392-1808; -1823
Email: [email protected]

Gunnar Rätsch, head of the working group "Machine Learning in Biology"; Tel .: 07071/601-820; -801
E-mail: [email protected]

Features of this press release:
Biology, nutrition / health / care, information technology, medicine
Research results, research projects