html2text function in Ruby
Whilst adding text-email feature in RssFwd (Ruby on Rails app), I had some problems looking for a html2text ruby library that'll do what I need - simple, clear text output, so I wrote my own.
So, in case somebody else is looking for this little function (but irritating to be missing) here's my shot at getting a similar-to-lynx-dump kind of output:
The code:
require 'cgi'
def html2text html
text = html.
gsub(/( |\n|\s)+/im, ' ').squeeze(' ').strip.
gsub(/<([^\s]+)[^>]*(src|href)=\s*(.?)([^>\s]*)\3[^>]*>\4<\/\1>/i, '\4')
links = []
linkregex = /<[^>]*(src|href)=\s*(.?)([^>\s]*)\2[^>]*>\s*/i
while linkregex.match(text)
links << $~[3]
text.sub!(linkregex, "[#{links.size}]")
end
text = CGI.unescapeHTML(
text.
gsub(/<(script|style)[^>]*>.*<\/\1>/im, '').
gsub(/<!--.*-->/m, '').
gsub(/<hr(| [^>]*)>/i, "___\n").
gsub(/<li(| [^>]*)>/i, "\n* ").
gsub(/<blockquote(| [^>]*)>/i, '> ').
gsub(/<(br)(| [^>]*)>/i, "\n").
gsub(/<(\/h[\d]+|p)(| [^>]*)>/i, "\n\n").
gsub(/<[^>]*>/, '')
).lstrip.gsub(/\n[ ]+/, "\n") + "\n"
for i in (0...links.size).to_a
text = text + "\n [#{i+1}] <#{CGI.unescapeHTML(links[i])}>" unless links[i].nil?
end
links = nil
text
end
Sample html input string:
<h1>Title</h1>
This is the body. Testing <a href="http://www.google.com/">link to Google</a>.<p />
Testing image <img src="/noimage.png">.<br />
The End.
The generated output string:
Title
This is the body. Testing [1]link to Google.
Testing image [2].
The End.
[1] <http://www.google.com/>
[2] </noimage.png>
Comments are welcome.
Note:
- The <> around the list of links at the bottom? they supposedly helps more arcanic e-mail programs understand that they should not be broken up into multiple lines.
- The HTML entities aren't fully converted by CGI.unescapeHTML(e.g. —) if there's a better method to use, lemme know