Genome Rearrangements

In the mathematical theory of genome rearrangements, genomes are represented by lists of elements (genes) and the rearrangements are operations that modify these lists. Typical examples of such events are reversals and transpositions.

I have been interested in these problems since my first contact with them on a Combinatorial Pattern Matching (CPM) conference, in Asilomar, California, USA, in 1994. From this date up to around 2007, several students of mine, including Maria Emília Machado Telles Walter, Zanoni Dias, Cleber Valgas Gomes Mira, Vinicius Fortuna, and André Atanasio Maranhão Almeida worked on theoretical aspects of the problem. They helped me shape an interesting idea, which is to apply the mathematical theory of permutations to rearragement problems, leading to an embrionary form of what we called algebraic rearrangement theory.

After studying the theory of genome rearrangements for some time, I felt the urge to apply some of the things I learned to actual genomes. My Masters students Patrícia Pilisson Côgo and Karina Zupo de Oliveira applied the theory to Vibrio genomes, in an informal collaboration with Fabiano Thompson, then at the University of Campinas, now at the Federal University of Rio de Janeiro, who studies these bacteria.

Côgo's contribution included working out several practical aspects of transforming real genomes into permutations, such as identifying homologous regions, treating regions without counterpart in the other genome, and treating duplications. We used the DCJ distance to construct a phylogenetic tree of the 6 Vibrio genomes completely sequenced up to 2005. Côgo implemented her tests in a series of perl scripts. Oliveira took them a step further, rewriting the entire pipeline in Java, using a modern IDE and version control. In the process, she noticed that the way we were constructing protein families to assess homology was very time consuming. She suggested instead that we should use Protein Clusters, a protein family database and tools available from NCBI. This reduced the time to add a new species from a month to less than a day. As a result, she could pairwise compare ten Vibrio species plus an outgroup (E. coli).

Côgo and Oliveira used DCJ as a distance measure after reducing the genomes to equal content without repeated genes. Another student of mine, Pedro Feijão, a doctoral student, started to experiment with something we called SCJ, a single-cut counterpart of DCJ. It is a measure for which several hard problems turn out to have tractable solutions, including small parsimony. He also wrote lots of Java code to implement SCJ.

In 2011, Pedro Feijão discovered a way of extending the algebraic theory to linear genomes, yielding a distance very close to, but not exactly the same as, DCJ. Like DCJ, this algebraic distance is easy to compute between two genomes, but medians, which are hard for DCJ, are easy for the algebraic distance, if one is willing to dive into orthogonal matrices (see below). Algebraic rearrangement theory may be a tough subject for many, but we love it. Feijão tried hard to make it palatable to a wider audience in his PhD thesis, and I think he did a wonderful job. We also investigated this same distance in the more general settings of non-genomic permutations and of general matrices, with my students João Paulo Pereira Zanetti and Priscila Biller.

Biller also run many experiments with SCJ, to try and find out its suitability to reconstructing phylogenies, with good results. She improved on Feijão's initial Java code, and wrote a manual to help others use the code.

In 2013, Sophia Yancopoulos kindly invited to me co-author a contribution to the MAGE workshop in honor of David Sankoff. The text explores the DCJ and algebraic distances, trying to uncover the reasons for their similarities and differences.

The algebraic distance is an alternative to DCJ, since both are easy to compute and model many of the rearrangement events observed in genome evolution. Our interest turned to finding genomes that are close to three given genomes, an important step in phylogeny reconstruction. Besides the traditional medians, which minimize the sum of the algebraic distances, we are experimenting with minimax genomes, which minimize the maximum of the distances.

The algebraic distance has also an interesting relationship with matrix theory. Genomes can be seen as matchings of gene extremities. We can code such a matching as a matrix as follows. Consider the adjacency matrix of this matching, but add 1's in the diagonal for the unmatched vertices. Voilá! You have a genome matrix. It turns out that the algebraic distance between two genomes is exactly half of the rank distance between the corresponding matrices.

Looking at genomes as matrices has an added benefit: one can solve the median problem in polynomial time. However, the solution is not always "genomic".

Here is some of our work on the subject, in reverse chronological order:



Joao Meidanis
Last modified: Sun May 17 16:24:25 -03 2020 by JM