GenAlyzer: Visualizing Sequence Similarities between Entire Genomes

Jomuna V. Choudhuri
Chris Schleiermacher

August 16, 2004

GenAlyzer is a software tool designed for the interactive visualisation of sequence matches between DNA or Protein sequences. It provides visualisations on different levels of granularity, from complete overviews via zoomed regions to alignements of particular matching substrings. Gen can efficiently handle very large datasets, allowing to display tens of thousands of matches between sequences of tens of millions of bases [2].

General Features of GenAlyzer

GenAlyzer is an improved version of the repeats finding program REPuter. It is developed for the UNIX platform and is currently also working on Linux/Intel, Solaris/SUN-Sparc and Mac OSX. The GenAlyzer-manual [1] gives a detailed description on its implementation and all interactive features. Here are the most important visualisation components of GenAlyzer, including an application example. ( Here is a printable version of this HTML-page in PDF, or download the technical report (3 Mb) containing the entire version of the GenAlyzer manual.)

The Main Window

The GenAlyzer Main Window consists of a menu bar with the entries ``File'', ``Edit'', ``View'' and ``Help'', a row of tool buttons, the set of square buttons in Figure 1, three Quick Start buttons and the program status output on the bottom of the window.

Figure 1: The GenAlyzer Main Window....
comparison of two genomes as a matchgraph

Preprocessing the Data

The ``Preprocess Step Dialog'' (Figure 2) comes up by clicking the button Generate new index in the main GenAlyzer window. In this first step of repeat detection, GenAlyzer creates an index for the given set of input sequences, specified in the Database Files panel. Here, the push buttons Add, Remove and Clear allow the manipulation of a input sequence file list. GenAlyzer supports the following formats for the input files: FASTA (or multiple FASTA), EMBL/SWISSPROT, GENBANK, and raw format. In the Sequence Type option menu, the alphabet of the input sequence is defined (DNA, protein, or a user specific symbol map). Under Project output options, the name and location for the resulting index file must be specified. The index name gets the default extension ``.prj''.

Figure 2: The Preprocess Step Dialog
Preprocessing the data by index construction

Performing Matching Tasks

In this step, a Query sequence chosen by the user is matched against the Database sequence based on its index constructed during the previous preprocessing step. Again, from GenAlyzer's main window, the button Run matching task launches the ``Matching Step Dialog'' (see Figure 3). The match files get the default extension ``.match''.

Figure 3: The Matching Tasks Dialog
Match Data Dialog

Currently, GenAlyzer solves two different matching problems:

Substring Matching

Complete Matching

For both tasks, the matches can be direct (forward) and reverse complemented (palindromic). The matches can also be approximate: degenerate substrings with a maximal number of errors as mismatches, or insertions and deletions (indels) are supported.

A third matching option, the X-Drop approach is being developed to be supported by GenAlyzer. The -exdrop parameter represents an alternative strategy for seed extension. The purpose is to find the highest-scoring alignment, once the matches, mismatches and indels are given different score values.

Repeat x Match

Visualising the Matching Output

After the ``Matching Task Step'', where the desired matches types had been calculated, the Inspector window can be launched. This interactive visualization component of GenAlyzer uses an easy-to-use graphical representation of repeats, or matches, and their sizes and positions.

A matching output calculated by GenAlyzer consists of the following parts:

Size Subseq Position 1 Type Position 2 Subseq Error E-value Score Perc.Ident
30 0 1158853 D 1144098 0 2 1.64e-061 342 98.85

The Inspector window comes up showing, in the match graph, two bold lines which represent the input sequence(s) supplied by the user (Database and Query, shown in Figure 4).

Figure 4: The Inspector window
Inspector window

By launching the Inspector window the user has an impression of the overall number of matches and their pattern distribution. Here is the description of the main subparts of GenAlyzer:

The Match Data Dialog

Selecting a match in the match data browser, the user can launch the ``Match Data Dialog'' by clicking the View Match button (Figure 6). Alternatively, the same dialog can display all the computed matches (View all Matches). A number of switches allows the users to customize the amount of detail to be shown in the output, like sorting options, output width, description width, alignment, e-value.

Figure 6: The Match Data Dialog, showing the alignment of a chosen match.
The Match Data Dialog


Annotation the visualisation

An annotation file can be created manually or by shell scripts to annotate any kinds of sequence features in the corresponding visualisation. An annotation file looks like this:

<=>0030443552#FF0000#1.10 Intr
=006222762454#000000#12.02 Term
<=0143274431#0000FF#1.05 Prom
*1296769678#FF00FF#1.03 Point

As shown in the example above, the annotation must be a text file of the following format:

SymbolStrandAnno_SetPos_1Pos_2#Color#Comment

Symbol:
This ASCII text symbol is translated into a graphical annotation marker. See the manual for details on the annotation symbol usage.

Pos_1 and Pos_2:
These parameters denote the starting and ending position of the annotation symbol. Note, that some symbols that have no horizontal extension require only one position value. Nevertheless a second position must be supplied which can have an arbitrary value.

Color
The Color entry can either be a hexadecimal value starting with a hash character #RRGGBB, consisting of three two digit hexadecimal values for red, green and blue. For example, #FF0000 means red, #00FF00 means green. This notation is widely used for HTML documents. Alternatively, Color can be one of the 752 color identifiers as specified under most UNIX systems.

Strand and Anno_Set:
Each strand of the match graph can have its own set of annotations, consisting of up to 10 individual rows. The Strand parameter specifies if an annotation symbol is drawn parallel to the upper or to the lower strand. The Anno_Set parameter specifies the row the annotation symbol should be in.

The resulting annotation file is then loaded via the ``Edit'' menu in the Inspector window, as shown in Figure 4. Similar to the repeat information displayed in the ``Match data browser'' after clicking a repeat on the strand symbol, the data associated with an annotation symbol can be displayed in the ``Annotation data browser'' (arrow ``c'' in Figure 5).

Example Application

After the publishing of the whole sequence of chromosome 22, its entire sequence has been analyzed with GenAlyzer for repetitive structures. A quite confusing pattern in the beginning of the sequence called our attention. The correspoding subsequence was extracted and searched for repeats using a lower threshold (minimal length 50bp, edit distance 2). Figure 7 shows an overview of this area, revealing an interesting net-like structure of direct and palindromic repeats.

The main sequence module is repeated four times, being comprised of smaller repeated units which are present in direct and palindromic orientation to each other. This net-like pattern was compared to the Low Copy Repeats scheme described by T. Shaik et al., known to be responsible for large deletions that cause the genetic disorder DiGeorge/Velo-Cardio-Facial-Syndrome. In Figure 7, the 4 Low Copy Repeats involved in the disease are represented by the blocks A,B,C and D, overlapping the repeat graph computed with GenAlyzer.

Figure 6: Net-like pattern of low copy repeats on human chromosome 22. The repetitive structure extends over a 3 Mb region on the chromosome, corresponding to the Typical Deleted Region responsible for the DiGeorge/Velo-Cardio-Facial-Syndrome.
The Match Data Dialog

Overall, an analysis of the repeat structure of different chromosomes using GenAlyzer is helpful to identify such breakpoint regions regarding the localization of Low Copy Repeats, without any experimental approach. This is just one of the manyfold applications of GenAlyzer. For other practical examples on biological problems, see Kurtz et al. [3], or the GenAlyzer manual.

Availability

GenAlyzer is available together with vmatch. The license agreement can be found in the Vmatch web-site .

Bibliography

1
J.V. Choudhuri and C. Schleiermacher.
Genalyzer: an interactive visualisation tool for large-scale sequence matching - biological applications and use manual.
Technical report, Bielefeld University, 2003.

2
J.V. Choudhuri, C. Schleiermacher, S. Kurtz, and R. Giegerich.
Genalyzer: interactive visualization of sequence similarities between entire genomes.
Bioinformatics, 20:1964-1965, 2004.

3
S. Kurtz, J.V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich.
REPuter: the Manifold Applications of Repeat Analysis on a Genomic Scale.
Nucleic Acids Research, 29(22):4633-4642, 2001.