Dec 9, 2007
Rails General, Rails Techniques

HTML-Aware Truncate Text

When building a large custom PHP CMS system for DigitalPeach, I ran into a very difficult issue: truncating text but maintaining HTML nested tags correctly. Specifically, we were looking to breaking up large articles composed using FCKEditor into separate pages after a certain character threshold. Once can easily see the problem:

<p>This is a test <strong>with some bold in here</strong>.</p>
Now imaging having to truncate this text to 30 characters, and you end up with this:

This is a test <strong>with

While this example isn't quite so severe and at worst would only make the rest of the text within the block-level element bold, clearly if we do a truncation that is blind to HTML some serious problems can arise. Furthermore, even if it doesn't have much practical significance, you are breaking XHTML.

I was lucky to stumble across an excellent article by Mike Burns who describes a Ruby method using REXML's pull parser that can accomplish this. His example extended the String class, so I modified it to work as a Rails helper all in one method:

def truncate_html(input, len = 30, extension = "...")
  def attrs_to_s(attrs)
    return '' if attrs.empty?
    attrs.to_a.map { |attr| %{#{attr[0]}="#{attr[1]}"} }.join(' ')
  end

  p = REXML::Parsers::PullParser.new(input)
    tags = []
    new_len = len
    results = ''
    while p.has_next? && new_len > 0
      p_e = p.pull
      case p_e.event_type
    when :start_element
      tags.push p_e[0]
      results << "<#{tags.last} #{attrs_to_s(p_e[1])}>"
    when :end_element
      results << "</#{tags.pop}>"
    when :text
      results << p_e[0].first(new_len)
      new_len -= p_e[0].length
    else
      results << "<!-- #{p_e.inspect} -->"
    end
  end

  tags.reverse.each do |tag|
    results << "</#{tag}>"
  end

  results.to_s + (input.length > len ? extension : '')
end

Note that the nested method above is a completely valid use of Ruby, though not widely used.

And now look at what it can do:

truncate_html("<p>Test <strong>bold</strong> done.</p>", 30)
# => "<p>Test <strong>bold</strong></p>..."
comments powered by Disqus