Use of Short Amino Acid Motifs in the Computational Analysis of Protein Diversity and Function
AuthorDussaq, Alex M.
AdvisorGrzymski, Joseph J.
AltmetricsView Usage Statistics
The explosion of whole genome sequence and environmental sequence data afford us the opportunity to explore protein diversity and protein function. This is particularly exciting given the nascent field of synthetic biology. A comprehensive computational analysis of extant proteins is needed in order to define the limitations on protein structure and diversity from a bioengineering perspective. This paper focuses on defining an upper limit for protein diversity using computational approaches derived from linguistic analyses. These methods are used to make a prediction on the upper limit of unique proteins and number of highly conserved motifs. Motifs deemed highly conserved will, more than likely represent important structural components of basic proteins. Results were gathered from two large data sets: all of the currently available microbial genome sequences available from NCBI and the Global Ocean Survey data set. There were 6.6 million unique proteins at 95% amino acid identity. The majority of unique motifs in these data sets were only found once. The motifs deemed highly conserved in lifestyle groupings of organisms and individual organisms were analyzed for function based on a conserved domain search. The importance between pathogenicity and cell motility and secretion related genes and proteins was observed. These motifs represent potential new drug targets or areas of future experimentation.