Tools for enabling the navigation of multiple sequence alignments in search of regions satisfying a given set of constraints

Multiple sequence alignments are powerful tools for the inference of genomic sequence evolution and function. Even pairwise comparisons are very useful, especially for the location and, sometimes, characterization of genes, but they do not have sufficient power to elucidate shorter or polymorphic functional sites. For a conservation pattern to achieve statistical significance one needs to observe it over many (generally dozens) of bases of DNA, which is too long for many binding or structural motifs. However, even a short interval of relatively weak conservation can become significant if it is present in many (properly) aligned sequences.

In addition to its use in the detection of subtle motifs, multiple alignments can provide a wealth of information about the general structure of genomic regions, relative identity or divergence of sequences or groups of sequences and other developmental information. In order to facilitate the sequence analysis, we have developed a prototype mechanism for the specification and location of intervals of interest, briefly described here. This method needs considerable further refinement. It needs to be better founded in the language theory, leading to less ad hoc descriptions of regions, and it also needs the development of intuitive visual interfaces in order to insulate the biologist user from the details of language syntax. Both of these issues are ongoing projects.


Algorithms for studying the regulation of duplicated and co-expressed genes

Segmental and chromosomal duplications are among the major forces driving the evolution and the differentiation of species. Following the duplication event, genes can be deactivated (usually recognized in DNA as "pseudo genes"), retain their function or develop a new one. Our aim is to develop computational techniques which would characterize the changes in the putative regulatory modules of these genes, and help elucidate the role of regulation in the divergence or retention of gene function.


Figure: After the segmental duplication event, genes can retain their function, but most either get deactivated, or adapt to a new function.

Recent advances in laboratory technology, primarily the now widespread use of microarray chips, have greatly facilitated the studies of co-expression of genes. Our algorithms and software should be directly applicable to the investigation of the structure of the regulatory elements of these genes, too. As a complementary function, this software can be used for whole-genome scans, in order to discover other regions with similarities in regulatory patterns. Having in mind the size of an average genome, it comes without saying that the algorithms doing this must be extremely fast.

One specific form of gene duplication is by retro-insertion of mRNA based templates back in the genome. In the instances when the inserted gene does not become deactivated, the insertion must have been followed by a successful recruitment of new regulatory elements (or exploitation of the existing ones). We plan to use our software to help characterize the regulatory modules of these retroposed genes.


Methods for location and characterization of differential phylogenetic footprints in mammalian Hox gene clusters

In comparative genomics, DNA regions of good sequence conservation are called phylogenetic footprints. When the conservation remains good among related species, but differs between groups, these intervals are called differential phylogenetic footprints. We have developed techniques for the location of such footprints, with respect to the background sequence conservation. In addition to the location of candidate regions, our software can classify and rank them using statistical techniques.

One early application of these methods is our analysis of mammalian Hox (Homeodomain) gene clusters. Hox genes code for transcription factors (proteins involved in the regulation of genes, enabling the start of the transcription process), and they are expressed during early embryonic development of an organism. Their role is in the determination and segmentation of the anterior-posterior axis of the developing embryo. In vertebrates, Hox clusters have gone through several rounds of segmental duplications (two in mammals, leading to four paralogous genomic segments), and these developments are considered crucial for the vast variety of vertebrate body plans.

Reflected in the global similarities of the body plans, Hox clusters exhibit considerable sequence conservation (sometimes featuring large segments of near-identity between species as far apart as humans and sharks). However, the variations make the differential analysis particularly interesting. We have obtained a large number of sequences from all four Hox clusters of several mammals (including human, baboon, cow, pig, mouse and rat) and constructed long (several hundred thousand bases) alignments of these clusters. The preliminary applications of our software have identified several high-scoring groups of differential phylogenetic footprints, and we are currently examining their significance and possible role.


Whole-genome scans for repeated motifs

Following up on some observations on the micro-repetitive structure of human genomic sequence (outside of the regions recognized as repetitive by tools such as RepeatMasker) we have recently undertaken a whole-genome analysis of short repetitive motifs and their possible origins. We are currently working on the software capable of undertaking these whole-genome scans, using a combination of classical string matching and graph algorithms and parallel and distributed computing.