The 4M (mixed memory Markov Model) algorithm for finding genes in prokaryotic genomes

Vidyasagar, M. ; Mande, S. S. ; Reddy, C. V. S. K. ; Rao, V. V. R. (2008) The 4M (mixed memory Markov Model) algorithm for finding genes in prokaryotic genomes IEEE Transactions on Automatic Control, 53 . pp. 26-37. ISSN 0018-9286

Full text not available from this repository.

Official URL: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arn...

Related URL: http://dx.doi.org/10.1109/TAC.2007.911360

Abstract

In this paper, we present a new algorithm called 4M (mixed memory Markov model) for finding genes from the genomes of prokaryotes. This is achieved by modeling the known coding regions of the genome as a set of sample paths of a multistep Markov chain (call it ) and the known non-coding regions as a set of sample paths of another multistep Markov chain (call it ). The new feature of the 4M algorithm is that different states are allowed to have different memory lengths, in contrast to a fixed multistep Markov model used in GeneMark in its various versions. At the same time, compared with an algorithm like Glimmer3 that uses an interpolation of Markov models of different memory lengths, the statistical significance of the conclusions drawn from the 4M algorithm is quite easy to quantify. Thus, when a whole genome annotation is carried out and several new genes are predicted, it is extremely easy to rank these predictions in terms of the confidence one has in the predictions. The basis of the 4M algorithm is a simple rank condition satisfied by the matrix of frequencies associated with a Markov chain. The 4M algorithm is validated by applying it to 75 organisms belonging to practically all known families of bacteria and archae. The performance of the 4M algorithm is compared with those of Glimmer3, GeneMark2.5d, and GeneMarkHMM2.6g. It is found that, in a vast majority of cases, the 4M algorithm finds many more genes than it misses, compared with any of the other three algorithms. Next, the 4M algorithm is used to carry out whole genome annotation of 13 organisms by using 50% of the known genes as the training input for the coding model and 20% of the known non-genes as the training input for the non-coding model. After this, all of the open reading frames are classified. It is found that the 4M algorithm is highly specific in that it picks out virtually all of the known genes, while predicting that only a small number of the open reading frames whose status is unknown- are genes.

Item Type:Article
Source:Copyright of this article belongs to IEEE.
ID Code:56934
Deposited On:25 Aug 2011 09:36
Last Modified:25 Aug 2011 09:36

Repository Staff Only: item control page