Jomuna V. Choudhuri
Chris Schleiermacher
August 16, 2004
GenAlyzer is a software tool designed for the interactive visualisation of sequence matches between DNA or Protein sequences. It provides visualisations on different levels of granularity, from complete overviews via zoomed regions to alignements of particular matching substrings. Gen can efficiently handle very large datasets, allowing to display tens of thousands of matches between sequences of tens of millions of bases [2].
GenAlyzer is an improved version of the repeats finding program REPuter. It is developed for the UNIX platform and is currently also working on Linux/Intel, Solaris/SUN-Sparc and Mac OSX. The GenAlyzer-manual [1] gives a detailed description on its implementation and all interactive features. Here are the most important visualisation components of GenAlyzer, including an application example. ( Here is a printable version of this HTML-page in PDF, or download the technical report (3 Mb) containing the entire version of the GenAlyzer manual.)
The GenAlyzer Main Window consists of a menu bar with the entries ``File'', ``Edit'', ``View'' and ``Help'', a row of tool buttons, the set of square buttons in Figure 1, three Quick Start buttons and the program status output on the bottom of the window.
|
The ``Preprocess Step Dialog'' (Figure 2) comes up by clicking the button Generate new index in the main GenAlyzer window. In this first step of repeat detection, GenAlyzer creates an index for the given set of input sequences, specified in the Database Files panel. Here, the push buttons Add, Remove and Clear allow the manipulation of a input sequence file list. GenAlyzer supports the following formats for the input files: FASTA (or multiple FASTA), EMBL/SWISSPROT, GENBANK, and raw format. In the Sequence Type option menu, the alphabet of the input sequence is defined (DNA, protein, or a user specific symbol map). Under Project output options, the name and location for the resulting index file must be specified. The index name gets the default extension ``.prj''.
|
In this step, a Query sequence chosen by the user is matched against the Database sequence based on its index constructed during the previous preprocessing step. Again, from GenAlyzer's main window, the button Run matching task launches the ``Matching Step Dialog'' (see Figure 3). The match files get the default extension ``.match''.
|
Currently, GenAlyzer solves two different matching problems:
For both tasks, the matches can be direct (forward) and reverse complemented (palindromic). The matches can also be approximate: degenerate substrings with a maximal number of errors as mismatches, or insertions and deletions (indels) are supported.
A third matching option, the X-Drop approach is being developed to be supported by GenAlyzer. The -exdrop parameter represents an alternative strategy for seed extension. The purpose is to find the highest-scoring alignment, once the matches, mismatches and indels are given different score values.
After the ``Matching Task Step'', where the desired matches types had been calculated, the Inspector window can be launched. This interactive visualization component of GenAlyzer uses an easy-to-use graphical representation of repeats, or matches, and their sizes and positions.
A matching output calculated by GenAlyzer consists of the following parts:
| Size | Subseq | Position 1 | Type | Position 2 | Subseq | Error | E-value | Score | Perc.Ident | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 30 | 0 | 1158853 | D | 1144098 | 0 | 2 | 1.64e-061 | 342 | 98.85 |
The Inspector window comes up showing, in the match graph, two bold lines which represent the input sequence(s) supplied by the user (Database and Query, shown in Figure 4).
|
By launching the Inspector window the user has an impression of the overall number of matches and their pattern distribution. Here is the description of the main subparts of GenAlyzer:
The three buttons at the top of the Inspector window select the kind of repeats to display:
The button label contains the number of repeats available for each kind. Our example above lists 88 direct and 123 palindromic repeats. If one of the two categories is not available, the respective buttons are disabled.
The Project info bar shows the project name and the parameters settings for the visualised match task.
An additional feature of GenAlyzer is the overview graph. It consists of a duplication of the repeat graph, in a smaller scale, with the advantage that the whole overview of the repetitive, or match structure remains while zooming in or out the actual match graph below. The entire overview graph is enclosed by a flexible red rectangle, which borders are adapted according to the zoom factor.
In the match graph below the overview, the top line corresponds to the input database sequence, and the botom line the Query sequence that has been matched against the Database indexed structure. The lines inbetween both inputs represent the matches, joining the beginning of the first match instance and the beginning of the second instance.
As Database and Query can be different sequences, GenAlyzer supports the uploading of the corresponding annotations from a defined file. The annotations can be represented as colored symbols, in several diffetent lines above the Database sequence and below the Query input. See Figure 4 and section Annotating the visualisation for more details.
The color key associates a color to a certain range of matches sizes. In figure 4, matches of sizes 65 to 70bp are displayed as yellow lines, for example. The length of the shortest and longest repeat are the starting and ending values of the color key scale, here 30 and 82bp.
The slider below the graph in Figure 4 defines the minimal repeat length depicted in the graph. The bounds of the size slider are the shortest and longest repeat among the current repeat kind.
The corresponding sequence information and the alignment of either a single match or all computed matches can be visualised or directly submitted to database searches, like FASTA or BLAST, for further investigation of biological significance and similarity.
As already mentioned above, to examine a particular match or match-rich region, the user can zoom in or out on a region by left or right clicking the mouse, respectively, as shown in Figure 5.
|
As soon as the user zooms into a specific region in the repeat graph, the red rectangle in the overview shrinks, bordering exactly the zoomed region, as it can be observed in Figure 5 (arrow ``a'').
This box shows the corresponding sequence information and positions of the selected match on the strand symbol in the match graph (arrow ``b'' in Figure 5).
Selecting a match in the match data browser, the user can launch the ``Match Data Dialog'' by clicking the View Match button (Figure 6). Alternatively, the same dialog can display all the computed matches (View all Matches). A number of switches allows the users to customize the amount of detail to be shown in the output, like sorting options, output width, description width, alignment, e-value.
|
An annotation file can be created manually or by shell scripts to annotate any kinds of sequence features in the corresponding visualisation. An annotation file looks like this:
| <=> | 0 | 0 | 3044 | 3552 | #FF0000 | #1.10 Intr | ||||||||||||||||||
| = | 0 | 0 | 62227 | 62454 | #000000 | #12.02 Term | ||||||||||||||||||
| <= | 0 | 1 | 4327 | 4431 | #0000FF | #1.05 Prom | ||||||||||||||||||
| * | 1 | 2 | 9676 | 9678 | #FF00FF | #1.03 Point |
As shown in the example above, the annotation must be a text file of the following format:
| Symbol | Strand | Anno_Set | Pos_1 | Pos_2 | #Color | #Comment |
The resulting annotation file is then loaded via the ``Edit'' menu in the Inspector window, as shown in Figure 4. Similar to the repeat information displayed in the ``Match data browser'' after clicking a repeat on the strand symbol, the data associated with an annotation symbol can be displayed in the ``Annotation data browser'' (arrow ``c'' in Figure 5).
After the publishing of the whole sequence of chromosome 22, its entire sequence has been analyzed with GenAlyzer for repetitive structures. A quite confusing pattern in the beginning of the sequence called our attention. The correspoding subsequence was extracted and searched for repeats using a lower threshold (minimal length 50bp, edit distance 2). Figure 7 shows an overview of this area, revealing an interesting net-like structure of direct and palindromic repeats.
The main sequence module is repeated four times, being comprised of smaller repeated units which are present in direct and palindromic orientation to each other. This net-like pattern was compared to the Low Copy Repeats scheme described by T. Shaik et al., known to be responsible for large deletions that cause the genetic disorder DiGeorge/Velo-Cardio-Facial-Syndrome. In Figure 7, the 4 Low Copy Repeats involved in the disease are represented by the blocks A,B,C and D, overlapping the repeat graph computed with GenAlyzer.
|
Overall, an analysis of the repeat structure of different chromosomes using GenAlyzer is helpful to identify such breakpoint regions regarding the localization of Low Copy Repeats, without any experimental approach. This is just one of the manyfold applications of GenAlyzer. For other practical examples on biological problems, see Kurtz et al. [3], or the GenAlyzer manual.
GenAlyzer is available together with vmatch. The license agreement can be found in the Vmatch web-site .