Identification of Protein Coding Regions in Microbial Genomes Using Unsupervised Clustering
Computer Science and Engineering
AltmetricsView Usage Statistics
At present the genomes of many organisms have been sequenced, meaning that their nucleotide structure is known but the location of genes, and most importantly, the coding regions, are unknown. Identifying coding regions is of vital importance, as they code for proteins. Distinguishing between coding and non coding regions is a difficult undertaking and many research efforts have been studied. We describe here an unsupervised clustering algorithm to find out protein coding regions in microbial genomic DNA sequences. The algorithm is based on a simple measure called vector of frequencies of nucleotides in sliding window and uses an ab-initio iterative Markov modeling procedure to partition the genomic sequences into coding, coding on the opposite strand and non-coding regions. The algorithm is very efficient and it can be used for any type of microbial genomes and also for uncharacterized microorganisms. Based on a method developed by Audic and Claverie, we improved the accuracy of finding coding regions and also found the nearest transition point from one class to another with an accuracy matching and exceeding the level of the best currently used gene detection methods. The method was examined on 18 complete microbial genomes from Genbank which covers four classes of major phylogenic lineages (Gram negative, Gram positive, cyanobacteria, and archaea). The results showed an improvement in performance of predicting coding regions of microbial genomes.