class FeedNormalizer::HtmlCleaner

Various methods for cleaning up HTML and preparing it for safe public consumption.

Documents used for refrence:

Constants

DODGY_URI_SCHEMES
HTML_ATTRS

allowed attributes.

HTML_ELEMENTS

allowed html elements.

HTML_URI_ATTRS

allowed attributes, but they can contain URIs, extra caution required. NOTE: That means this doesnt list all URI attrs, just the ones that are allowed.

Public Class Methods

add_entities(str) click to toggle source

Adds entities where possible. Works like CGI.escapeHTML, but will not escape existing entities; i.e. { will NOT become {

This method could be improved by adding a whitelist of html entities.

# File lib/html-cleaner.rb, line 152
def add_entities(str)
  str.to_s.gsub(/\"/n, '&quot;').gsub(/>/n, '&gt;').gsub(/</n, '&lt;').gsub(/&(?!(\#\d+|\#x([0-9a-f]+)|\w{2,8});)/nmi, '&amp;')
end
clean(str) click to toggle source

Does this:

  • Unescape HTML

  • Parse HTML into tree

  • Find 'body' if present, and extract tree inside that tag, otherwise parse whole tree

  • Each tag:

    • remove tag if not whitelisted

    • escape HTML tag contents

    • remove all attributes not on whitelist

    • extra-scrub URI attrs; see dodgy_uri?

Extra (i.e. unmatched) ending tags and comments are removed.

# File lib/html-cleaner.rb, line 60
def clean(str)
  str = unescapeHTML(str)

  doc = Hpricot(str, :fixup_tags => true)
  doc = subtree(doc, :body)

  # get all the tags in the document
  # Somewhere near hpricot 0.4.92 "*" starting to return all elements,
  # including text nodes instead of just tagged elements.
  tags = (doc/"*").inject([]) { |m,e| m << e.name if(e.respond_to?(:name) && e.name =~ /^\w+$/) ; m }.uniq

  # Remove tags that aren't whitelisted.
  remove_tags!(doc, tags - HTML_ELEMENTS)
  remaining_tags = tags & HTML_ELEMENTS

  # Remove attributes that aren't on the whitelist, or are suspicious URLs.
  (doc/remaining_tags.join(",")).each do |element|
    next if element.raw_attributes.nil? || element.raw_attributes.empty?
    element.raw_attributes.reject! do |attr,val|
      !HTML_ATTRS.include?(attr) || (HTML_URI_ATTRS.include?(attr) && dodgy_uri?(val))
    end

    element.raw_attributes = element.raw_attributes.build_hash {|a,v| [a, add_entities(v)]}
  end unless remaining_tags.empty?
  
  doc.traverse_text do |t|
    t.swap(add_entities(t.to_html))
  end

  # Return the tree, without comments. Ugly way of removing comments,
  # but can't see a way to do this in Hpricot yet.
  doc.to_s.gsub(/<\!--.*?-->/mi, '')
end
dodgy_uri?(uri) click to toggle source

Returns true if the given string contains a suspicious URL, i.e. a javascript link.

This method rejects javascript, vbscript, livescript, mocha and data URLs. It could be refined to only deny dangerous data URLs, however.

# File lib/html-cleaner.rb, line 117
def dodgy_uri?(uri)
  uri = uri.to_s

  # special case for poorly-formed entities (missing ';')
  # if these occur *anywhere* within the string, then throw it out.
  return true if (uri =~ /&\#(\d+|x[0-9a-f]+)[^;\d]/mi)

  # Try escaping as both HTML or URI encodings, and then trying
  # each scheme regexp on each
  [unescapeHTML(uri), CGI.unescape(uri)].each do |unesc_uri|
    DODGY_URI_SCHEMES.each do |scheme|

      regexp = "#{scheme}:".gsub(/./) do |char|
        "([\000-\037\177\s]*)#{char}"
      end

      # regexp looks something like
      # /\A([\000-\037\177\s]*)j([\000-\037\177\s]*)a([\000-\037\177\s]*)v([\000-\037\177\s]*)a([\000-\037\177\s]*)s([\000-\037\177\s]*)c([\000-\037\177\s]*)r([\000-\037\177\s]*)i([\000-\037\177\s]*)p([\000-\037\177\s]*)t([\000-\037\177\s]*):/mi
      return true if (unesc_uri =~ %r{\A#{regexp}}mi)
    end
  end

  nil
end
flatten(str) click to toggle source

For all other feed elements:

  • Unescape HTML.

  • Parse HTML into tree (taking 'body' as root, if present)

  • Takes text out of each tag, and escapes HTML.

  • Returns all text concatenated.

# File lib/html-cleaner.rb, line 99
def flatten(str)
  str.gsub!("\n", " ")
  str = unescapeHTML(str)

  doc = Hpricot(str, :xhtml_strict => true)
  doc = subtree(doc, :body)

  out = []
  doc.traverse_text {|t| out << add_entities(t.to_html)}

  return out.join
end
unescapeHTML(str, xml = true) click to toggle source

unescapes HTML. If xml is true, also converts XML-only named entities to HTML.

# File lib/html-cleaner.rb, line 143
def unescapeHTML(str, xml = true)
  CGI.unescapeHTML(xml ? str.gsub("&apos;", "&#39;") : str)
end

Private Class Methods

remove_tags!(doc, tags) click to toggle source
# File lib/html-cleaner.rb, line 163
def remove_tags!(doc, tags)
  (doc/tags.join(",")).remove unless tags.empty?
end
subtree(doc, element) click to toggle source

Everything below elment, or the just return the doc if element not present.

# File lib/html-cleaner.rb, line 159
def subtree(doc, element)
  doc.at("//#{element}/*") || doc
end