diffseq finds the region of overlap of the input sequences and then reports differences within this region, like a local alignment.
The start and end positions of the overlap are reported.
diffseq should be of value when looking for SNPs, differences between strains of an organism and anything else that requires the differences between sequences to be highlighted.
The sequences can be very long. The program does a match of all sequence words of size 10 (by default). It then reduces this to the minimum set of overlapping matches by sorting the matches in order of size (largest size first) and then for each such match it removes any smaller matches that overlap. The result is a set of the longest ungapped alignments between the two sequences that do not overlap with each other. The mismatched regions between these matches are reported.
It should be possible to find differences between sequences that are Mega-bases long.
|
By default diffseq writes a 'diffseq' report file.
The first line is the title giving the names of the sequences used.
The next two non-blank lines state the positions in each sequence where the detected overlap between them starts.
There then follows a set of reports of the mismatches between the sequences.
Each report consists of 4 or more lines.
This is followed by the equivalent information for the second sequence, but in the reverse order, namely 'Sequence:' line, 'Feature:' lines and line giving the position of the mismatch in the second sequence.
At the end of the report are two non-blank lines giving the positions in each sequence where the detected overlap between them ends.
The last three lines of the report gives the counts of SNPs (defined as a change of one nucleotide to one other nucleotide, no deletions or insertions are counted, no multi-base changes are counted).
If the input sequences are nucleic acid, The counts of transitions (Pyrimide to Pyrimidine or Purine to Purine) and transversions (Pyrimidine to Purine) are also given.
It should be noted that not all features are reported.
The 'source' feature found in all EMBL/Genbank feature table entries is not reported as this covers all of the sequence and so overlaps with any difference found in that sequence and so is uninformative and irritating. It has therefore been removed from the output report.
The translation information of CDS features is often extremely long and does not add useful information to the report. It has therefore been removed from the output report.
The 'source' feature found in all EMBL/Genbank feature table entries is not reported as this covers all of the sequence and so overlaps with any difference found in that sequence and so is uninformative and irritating. It has therefore been removed from the output report.
The translation information of CDS features is often extremely long and does not add useful information to the report. It has therefore been removed from the output report.
If you run out of memory, use a larger word size.
Using a larger word size increases the length between mismatches that will be reported as one event. Thus a word size of 50 will report two single-base differences that are with 50 bases of each other as one mismatch.