Unless instructed otherwise, the program makes three alignments: First it compares both stands of the spliced sequence against the forward strand of the genomic, assuming the splice consensus GT/AG (ie in the forward gene direction). The maximum-scoring orientation is then realigned assuming the splice consensus CT/AC (ie in the reversed gene direction). Only the overall maximum-scoring alignment is reported.
The program outputs a list of the exons and introns it has found. The format is like that of MSPcrunch, ie a list of matching segments. This format is easy to parse into other software. The program also indicates, based on the splice site information, the gene's predicted direction of transcription. Optionally the full sequence alignment is printed as well (see the example).
1. A first pass Smith-Waterman local alignment scan is done to find the start and end of the maximally scoring segments.
2. Subsequences corresponding to these segments are extracted
3a. If the product of the subsequences' lengths is less than a user-defined threshold (i.e. they will fit in memory) the segments are realigned using the Needleman-Wunsch global alignment algorithm, which will give the same result as the Smith-Waterman since the subsequences are guaranteed to align end-to-end.
3b. If the product of the lengths exceeds the threshold (a full alignment will not fit in memory) the alignment is made recursively by splitting the spliced (EST) sequence in half and finding the genome sequence position which aligns with the mid-point. The process is repeated until the product of gthe lengths is less than the threshold. The divided sequences are aligned separately and then merged.
4. The genome sequence is searched against the forward and reverse strands of the spliced (EST) sequence, assuming a forward gene splicing direction (i.e. GT/AG consensus).
5. Then the best-scoring orientation is realigned assuming reverse splicing (CT/AC consensus). The overall best alignment is reported.
|
The score for Exon segments is the alignment score excluding flanking intron penalties. The Span score is the total including the intron costs.
The coordinates of the genomic sequence always refer to the positive strand, but are swapped if the est has been reversed. The splice direction of Introns are indicated as +Intron (forward, splice sites GT/AG) or -Intron (reverse, splice sites CT/AC), or ?Intron (unknown direction). Segment entries give the alignment as a series of ungapped matching segments.
parameter default description match 1 score for matching two bases mismatch 1 cost for mismatching two bases gap_penalty 2 cost for deleting a single base in either sequence, excluding introns intron_penalty 40 cost for an intron, independent of length. splice_penalty 20 cost for an intron, independent of length and starting/ending on donor-acceptor sites. space 10 Space threshold (in megabytes) for linear-space recursion. If the product of the two sequence lengths divided by 4 exceeds this then a divide-and-conquer strategy is used to control the memory requirements. In this way very long sequences can be aligned. If you have a machine with plenty of memory you can raise this parameter (but do not exceed the machine's physical RAM) However, normally you should not need to change this parameter.There is no gap initiation cost for short gaps, just a penalty proportional to the length of the gap. Thus the cost of inserting a gap of length L in the EST is
L*gap_penaltyand the cost in the genome is
min { L*gap_penalty, intron_penalty } or min { L*gap_penalty, splice_penalty } if the gap starts with GT and ends with AG (or CT/AC if splice direction reversed)Introns are not allowed in the EST. The difference between the intron_penalty and splice_penalty allows for some slack in marking the intron end-points. It is often the case that the best intron boundaries, from the point of view of minimising mismatches, will not coincide exactly with the splice consensus, so provided the difference between the intron/splice penalties outweighs the extra mismatch/indel costs the alignment will respect the proper boundaries. If the alignment still prefers boundaries which don't start and end with the splice consensus then this may indicate errors in the sequences.
The default parameters work well, except for very short exons (length less than the splice_penalty, approx) which may be skipped. The intron penalties should not be set to less that the maximum expected random match between the sequences (typically 10-15 bp) in order to avoid spurious matches. The algorithm has the following steps:
The original program was est_genome, written by Richard Mott at the Sanger Centre. The original version is available from ftp://ftp.sanger.ac.uk/pub/pmr/est_genome.4.tar.Z