Ming-Ying Leung (Management Science and Statistics, U. Texas at San Antonio, USA)

Title: Prediction of replication origins in DNA viruses

In developing effective strategies to control diseases caused by pathogenic viruses, the ability to manipulate DNA replication in their genomes provide possible ways to alter viral growth. Replication origins in DNA are considered major sites for regulating genome replication. With the accumulation of DNA sequence data in vast computer databases readily accessible via the Internet, computer based sequence analysis tools can be developed to help locate replication origins in viral genomes.

It has been observed in many studies that clusters of close repeats and inversions are characteristic patterns found in the nucleotide sequences at the replication origins in a number of double stranded (ds)DNA viruses. In a first attempt to predict replication origins using these sequence patterns, we investigated a special kind of inversion, the palindromes. Palindromes are symmetrical words of DNA. They read exactly the same as their reverse complementary sequences.

Modeling DNA as a sequence of independent letters randomly sampled from the nucleotide alphabet {A, C, G, T} and using Stein's method, we derive a Poisson process limit for the occurrences of palindromes on DNA. This asymptotic result is proved by obtaining an upper bound for the Wasserstein distance between the palindrome process and a Poisson process, and demonstrating that the bound goes to zero under suitable conditions as the sequence length increases. With the Poisson process limit as a mathematical justification, one may assume that the palindrome positions on a DNA sequence behave like random points sampled uniformly from the unit interval. The scan statistics can then be employed to locate significant palindrome clusters in a set of herpesvirus genomes. Regions harboring significant palindrome clusters are identified and compared to known locations of replication origins.

When tested on a set of herpesviruses, the accuracy of the prediction based simply on clusters of palindromes is not totally satisfactory. Closer examination of the sequence data revealed that the accuracy can be improved by taking the length of the palindrome into consideration, as well as including other repeats and inversions in the prediction scheme. However, the probabilistic models for these sequence patterns have yet to be developed.