class Bio::GFF::GFF3::Record::Gap
Bio:GFF::GFF3::Record::Gap is a class to store data of “Gap” attribute.
Constants
- Code
Code is a class to store length of single-letter code.
Attributes
Internal data. Users must not use it.
Public Class Methods
Creates a new Gap object.
Arguments:
-
str: a formatted string, or nil.
# File lib/bio/db/gff.rb, line 1275 def initialize(str = nil) if str then @data = str.split(/ +/).collect do |x| if /\A([A-Z])([0-9]+)\z/ =~ x.strip then Code.new($1.intern, $2.to_i) else warn "ignored unknown token: #{x}.inspect" if $VERBOSE nil end end @data.compact! else @data = [] end end
Creates a new Gap object from given sequence alignment.
Note that sites of which both reference and target are gaps are silently removed.
Arguments:
-
reference: reference sequence (nucleotide sequence)
-
target: target sequence (nucleotide sequence)
-
gap_regexp: regexp to identify gap
# File lib/bio/db/gff.rb, line 1391 def self.new_from_sequences_na(reference, target, gap_regexp = /[^a-zA-Z]/) gap = self.new gap.instance_eval { __initialize_from_sequences_na(reference, target, gap_regexp) } gap end
Creates a new Gap object from given sequence alignment.
Note that sites of which both reference and target are gaps are silently removed.
For incorrect alignments that break 3:1 rule, gap positions will be moved inside codons, unwanted gaps will be removed, and some forward or reverse frameshift will be inserted.
For example,
atgg-taagac-att M V K - I
is treated as:
atggt<aagacatt M V K >>I
Incorrect combination of frameshift with frameshift or gap may cause undefined behavior.
Forward frameshifts are recomended to be indicated in the target sequence. Reverse frameshifts can be indicated in the reference sequence or the target sequence.
Priority of regular expressions:
space > forward/reverse frameshift > gap
Arguments:
-
reference: reference sequence (nucleotide sequence)
-
target: target sequence (amino acid sequence)
-
gap_regexp: regexp to identify gap
-
space_regexp: regexp to identify space character which is completely ignored
-
forward_frameshift_regexp: regexp to identify forward frameshift
-
reverse_frameshift_regexp: regexp to identify reverse frameshift
# File lib/bio/db/gff.rb, line 1587 def self.new_from_sequences_na_aa(reference, target, gap_regexp = /[^a-zA-Z]/, space_regexp = /\s/, forward_frameshift_regexp = /\>/, reverse_frameshift_regexp = /\</) gap = self.new gap.instance_eval { __initialize_from_sequences_na_aa(reference, target, gap_regexp, space_regexp, forward_frameshift_regexp, reverse_frameshift_regexp) } gap end
Same as new(str).
# File lib/bio/db/gff.rb, line 1292 def self.parse(str) self.new(str) end
Public Instance Methods
If self == other, returns true. otherwise, returns false.
# File lib/bio/db/gff.rb, line 1615 def ==(other) if other.class == self.class and @data == other.data then true else false end end
Processes nucleotide sequences and returns gapped sequences as an array of sequences.
Note for forward/reverse frameshift: Forward/Reverse_frameshift is simply treated as gap insertion to the target/reference sequence.
Arguments:
-
reference: reference sequence (nucleotide sequence)
-
target: target sequence (nucleotide sequence)
-
gap_char: gap character
# File lib/bio/db/gff.rb, line 1715 def process_sequences_na(reference, target, gap_char = '-') s_ref, s_tgt = dup_seqs(reference, target) s_ref, s_tgt = __process_sequences(s_ref, s_tgt, gap_char, gap_char, 1, 1, gap_char, gap_char) if $VERBOSE and s_ref.length != s_tgt.length then warn "returned sequences not equal length" end return s_ref, s_tgt end
Processes sequences and returns gapped sequences as an array of sequences. reference must be a nucleotide sequence, and target must be an amino acid sequence.
Note for reverse frameshift: Reverse_frameshift characers are inserted in the reference sequence. For example, alignment of “Gap=M3 R1 M2” is:
atgaagat<aatgtc M K I N V
Alignment of “Gap=M3 R3 M3” is:
atgaag<<<attaatgtc M K I I N V
Arguments:
-
reference: reference sequence (nucleotide sequence)
-
target: target sequence (amino acid sequence)
-
gap_char: gap character
-
space_char: space character inserted to amino sequence for matching na-aa alignment
-
forward_frameshift: forward frameshift character
-
reverse_frameshift: reverse frameshift character
# File lib/bio/db/gff.rb, line 1752 def process_sequences_na_aa(reference, target, gap_char = '-', space_char = ' ', forward_frameshift = '>', reverse_frameshift = '<') s_ref, s_tgt = dup_seqs(reference, target) s_tgt = s_tgt.gsub(/./, "\\0#{space_char}#{space_char}") ref_increment = 3 tgt_increment = 1 + space_char.length * 2 ref_gap = gap_char * 3 tgt_gap = "#{gap_char}#{space_char}#{space_char}" return __process_sequences(s_ref, s_tgt, ref_gap, tgt_gap, ref_increment, tgt_increment, forward_frameshift, reverse_frameshift) end
string representation
# File lib/bio/db/gff.rb, line 1604 def to_s @data.collect { |x| x.to_s }.join(" ") end
Private Instance Methods
(private method) Parses given reference-target sequence alignment and initializes self. Existing data will be erased.
# File lib/bio/db/gff.rb, line 1322 def __initialize_from_sequences_na(reference, target, gap_regexp = /[^a-zA-Z]/) data_ref = __scan_gap(reference, gap_regexp, :I, :M) data_tgt = __scan_gap(target, gap_regexp, :D, :M) data = [] while !data_ref.empty? and !data_tgt.empty? ref = data_ref.shift tgt = data_tgt.shift if ref.length > tgt.length then x = Code.new(ref.code, ref.length - tgt.length) data_ref.unshift x ref.length = tgt.length elsif ref.length < tgt.length then x = Code.new(tgt.code, tgt.length - ref.length) data_tgt.unshift x tgt.length = ref.length end case ref.code when :M if tgt.code == :M then data.push ref elsif tgt.code == :D then data.push tgt else raise 'Bug: should not reach here.' end when :I if tgt.code == :M then data.push ref elsif tgt.code == :D then # This site is ignored, # because both reference and target are gap else raise 'Bug: should not reach here.' end end end #while # rest of data_ref len = 0 data_ref.each do |r| len += r.length if r.code == :M end data.push Code.new(:D, len) if len > 0 # rest of data_tgt len = 0 data_tgt.each do |t| len += t.length if t.code == :M end data.push Code.new(:I, len) if len > 0 @data = data true end
(private method) Parses given reference(nuc)-target(amino) sequence alignment and initializes self. Existing data will be erased.
# File lib/bio/db/gff.rb, line 1451 def __initialize_from_sequences_na_aa(reference, target, gap_regexp = /[^a-zA-Z]/, space_regexp = /\s/, forward_frameshift_regexp = /\>/, reverse_frameshift_regexp = /\</) data = [] sc_ref = StringScanner.new(reference) sc_tgt = StringScanner.new(target) re_one = /./mn while !sc_tgt.eos? if len = sc_tgt.skip(space_regexp) then # ignored elsif len = sc_tgt.skip(forward_frameshift_regexp) then cur = __push_code_to_data(cur, data, :F, len) len.times { sc_ref.scan(re_one) } elsif len = sc_tgt.skip(reverse_frameshift_regexp) then cur = __push_code_to_data(cur, data, :R, len) pos = sc_ref.pos pos -= len if pos < 0 then warn "Incorrect reverse frameshift" if $VERBOSE pos = 0 end sc_ref.pos = pos elsif len = sc_tgt.skip(gap_regexp) then len.times do ref_gaps, ref_fs = __scan_codon(sc_ref, gap_regexp, space_regexp, forward_frameshift_regexp, reverse_frameshift_regexp) case ref_gaps when 3 # both ref and tgt are gap. ignored the site when 2, 1 # forward frameshift inserted ref_fs += (3 - ref_gaps) when 0 cur = __push_code_to_data(cur, data, :D, 1) else raise 'Bug: should not reach here' end if ref_fs < 0 then cur = __push_code_to_data(cur, data, :R, -ref_fs) elsif ref_fs > 0 then cur = __push_code_to_data(cur, data, :F, ref_fs) end end #len.times elsif len = sc_tgt.skip(re_one) then # always 1-letter ref_gaps, ref_fs = __scan_codon(sc_ref, gap_regexp, space_regexp, forward_frameshift_regexp, reverse_frameshift_regexp) case ref_gaps when 3 cur = __push_code_to_data(cur, data, :I, 1) when 2, 1, 0 # reverse frameshift inserted when gaps exist ref_fs -= ref_gaps # normal site cur = __push_code_to_data(cur, data, :M, 1) else raise 'Bug: should not reach here' end if ref_fs < 0 then cur = __push_code_to_data(cur, data, :R, -ref_fs) elsif ref_fs > 0 then cur = __push_code_to_data(cur, data, :F, ref_fs) end else raise 'Bug: should not reach here' end end #while if sc_ref.rest_size > 0 then rest = sc_ref.scan(/.*/mn) rest.gsub!(space_regexp, '') rest.gsub!(forward_frameshift_regexp, '') rest.gsub!(reverse_frameshift_regexp, '') rest.gsub!(gap_regexp, '') len = rest.length.div(3) cur = __push_code_to_data(cur, data, :D, len) if len > 0 len = rest.length % 3 cur = __push_code_to_data(cur, data, :F, len) if len > 0 end @data = data self end
(private method) insert gaps refers to the gap rule inside the object
# File lib/bio/db/gff.rb, line 1638 def __process_sequences(s_ref, s_tgt, ref_gap, tgt_gap, ref_increment, tgt_increment, forward_frameshift, reverse_frameshift) p_ref = 0 p_tgt = 0 @data.each do |c| #$stderr.puts c.inspect #$stderr.puts "p_ref=#{p_ref} s_ref=#{s_ref.inspect}" #$stderr.puts "p_tgt=#{p_tgt} s_tgt=#{s_tgt.inspect}" case c.code when :M # match p_ref += c.length * ref_increment p_tgt += c.length * tgt_increment when :I # insert a gap into the reference sequence begin s_ref[p_ref, 0] = ref_gap * c.length rescue IndexError raise 'reference sequence too short' end p_ref += c.length * ref_increment p_tgt += c.length * tgt_increment when :D # insert a gap into the target (delete from reference) begin s_tgt[p_tgt, 0] = tgt_gap * c.length rescue IndexError raise 'target sequence too short' end p_ref += c.length * ref_increment p_tgt += c.length * tgt_increment when :F # frameshift forward in the reference sequence begin s_tgt[p_tgt, 0] = forward_frameshift * c.length rescue IndexError raise 'target sequence too short' end p_ref += c.length p_tgt += c.length when :R # frameshift reverse in the reference sequence p_rev_frm = p_ref - c.length if p_rev_frm < 0 then raise 'too short reference sequence, or too many reverse frameshifts' end begin s_ref[p_rev_frm, 0] = reverse_frameshift * c.length rescue IndexError raise 'reference sequence too short' end else warn "ignored #{c.to_s.inspect}" if $VERBOSE end end if s_ref.length < p_ref then raise 'reference sequence too short' end if s_tgt.length < p_tgt then raise 'target sequence too short' end return s_ref, s_tgt end
(private method) internal use only
# File lib/bio/db/gff.rb, line 1437 def __push_code_to_data(cur, data, code, len) if cur and cur.code == code then cur.length += len else cur = Code.new(code, len) data.push cur end return cur end
(private method) scans a codon or gap in reference sequence
# File lib/bio/db/gff.rb, line 1403 def __scan_codon(sc_ref, gap_regexp, space_regexp, forward_frameshift_regexp, reverse_frameshift_regexp) chars = [] gap_count = 0 fs_count = 0 while chars.size < 3 + fs_count and char = sc_ref.scan(/./mn) case char when space_regexp # ignored when forward_frameshift_regexp # next char is forward frameshift fs_count += 1 when reverse_frameshift_regexp # next char is reverse frameshift fs_count -= 1 when gap_regexp chars.push char gap_count += 1 else chars.push char end end #while if chars.size < (3 + fs_count) then gap_count += (3 + fs_count) - chars.size end return gap_count, fs_count end
(private method) Scans gaps and returns an array of Code objects
# File lib/bio/db/gff.rb, line 1298 def __scan_gap(str, gap_regexp = /[^a-zA-Z]/, code_i = :I, code_m = :M) sc = StringScanner.new(str) data = [] while len = sc.skip_until(gap_regexp) mlen = len - sc.matched_size data.push Code.new(code_m, mlen) if mlen > 0 g = Code.new(code_i, sc.matched_size) while glen = sc.skip(gap_regexp) g.length += glen end data.push g end if sc.rest_size > 0 then m = Code.new(code_m, sc.rest_size) data.push m end data end
duplicates sequences
# File lib/bio/db/gff.rb, line 1625 def dup_seqs(*arg) arg.collect do |s| begin s = s.seq rescue NoMethodError end s.dup end end