Genome Rearrangements

In the mathematical theory of genome rearrangements, genomes are represented by lists of elements (genes) and the rearrangements are operations that modify these lists. Typical examples of such events are reversals and transpositions.

I have been interested in these problems since my first contact with them on a DIMACS Workshop in 1994. From this date up to around 2007, several students of mine, including Maria Emília Machado Telles Walter, Zanoni Dias, Cleber Valgas Gomes Mira, Vinicius Fortuna, and André Atanasio Maranhão Almeida worked theoretical aspects of the problem.

After studying the theory of genome rearrangements for some time, I felt the urge to apply some of the things I learned to actual genomes. The first step in this direction was to ask my Masters student Patrícia Pilisson Côgo to use the DCJ distance to construct a phylogenetic tree of all Vibrio genomes sequenced to date. This was 2005. There were 6 completely sequenced vibios in GenBank. Why vibrios? Because over the years, I established an informal collaboration with Fabiano Thompson, now at the Federal University of Rio de Janeiro, and he studies vibrios.

Côgo's contribution included working out several practical aspects of transforming real genomes into permutations, such as identifying homologous regions, treating regions without counterpart in the other genome, and treating duplications. A poster in the X-meeting, a poster in ISMB 2006, a tech report, and her dissertation were produced.

Then came another Masters student, Karina Zupo de Oliveira. Côgo implemented her tests in a series of perl scripts. Oliveira took them a step further, rewriting the entire pipeline in Java, using a modern IDE and version control. In the process, she noticed that the way we were constructing protein families to assess homology was very time consuming. She suggested instead that we should use Protein Clusters, a protein family database and tools available from NCBI. This reduced the time to add a new species from a month to less than a day. As a result, she could pairwise compare ten Vibrio species plus an outgroup (E. coli). Her dissertation describes the work done.

Côgo and Oliveira used DCJ as a distance measure after reducing the genomes to equal content without repeated genes. Another student of mine, Pedro Feijão, a doctoral student, started to experiment with something we called SCJ, a single-cut counterpart of DCJ. It can also be viewed as a way of measuring the number of breakpoints between two genomes. It is a measure for which several hard problems turn out to have tractable solutions, including small parsimony. He also wrote lots of Java code to implement SCJ.

Recently, Pedro Feijão discovered a way of extending the algebraic theory to linear genomes, yielding a distance very close to, but not exactly the same as, DCJ. Like DCJ, this algebraic distance is easy to compute between two genomes, but medians may be hard. Algebraic rearrangement theory may be a tough subject for many, but we love it. Feijão tried hard to make it palatable to a wider audience in his PhD thesis, and I think he did a wonderful job. We are also investigating this same distance in the more general settings of non-genomic permutations and of general matrices, with my new students João Paulo Pereira Zanetti and Priscila Biller.

Biller also run many experiments with SCJ, to try and find out its suitability to reconstructing phylogenies, with good results. She improved on Feijão's initial Java code, and wrote a manual to help others use the code.

Here is some of our work on the subject. References are in reverse chronological order:


Joao Meidanis
Last modified: Tue Jan 8 06:42:04 BRST 2013 by JM