Processing real-world HTML: a quick introduction
to html5lib
Processing real-world HTML
a quick introduction to html5lib
Theresa O’Connor
So you’ve got some HTML.
You found it out in the wild.
Or some user typed it into a form in your webapp.
<B>Hi, <I>Joe</b>!
<p/>
So good to </i><BLINK>finally
meet you & stuff.
…
You have got to be kidding me.

Tag Soup
Browsers handle such markup well and mostly uniformly.
The browser vendors have spent countless developer-hours reverse-engineering each others’ error recovery methods.
But tool developers have, historically, been screwed.
We usually resort to running text through Tidy, Beautiful Soup, or something similar.
These tools have their own tag soup error recovery, that often doesn’t match what browsers do.
When your error recovery doesn’t match browsers’ error recovery, users get screwed. Your app is buggy.
This was the state of the art in 2004.
HTML 5
Standardizing an HTML parsing algoritm that matches browser behavior.
html5lib
An implementation of the HTML 5 parsing algorithm in Ruby and Python (including Python 3).
Really easy to use.
import html5lib
f = open("mydocument.html")
parser = html5lib.HTMLParser()
document = parser.parse(f)
Just as easy in Ruby.
require 'html5lib/html5parser'
include HTML5
f = File.open("mydocument.html")
document = HTMLParser.parse(f)
Thousands of tests
Python
html5lib implements the spec so well, it even implements an
infinite loop.
— @gsnedders
fixed in html5lib 8 days ago: revision
21ce65db1e
fixed in HTML5 spec yesterday: r3538
Tree building
Plugs into your favorite DOM or DOM-like API
- Python: minidom, ElementTree, lxml, Beafutiful Soup
- Ruby: REXML, Hpricot
Tree walking
- Python: dom, ElementTree, genshi, lxml, pulldom, Beautiful Soup
- Ruby: REXML, Hpricot
Filters
- Sanitizer (whitelists)
- Conformance checker (validator)
Liberal character set detection (chardet
)
My skeertuig is vol palings • حَوّامتي مُمْتِلئة بِأَنْقَلَيْسون • Իմ օդաթիռը լի է օձաձկերով • 我的氣墊船裝滿了鱔魚 • Mia kusenveturilo estas plena je angiloj • הרחפת שלי מלאה בצלופחים
Infoset coercion (ihatexml.py
)
Can happily take in real-world HTML as input into an XML toolchain
Liberal XML parser
Think the Universal Feed Parser, but for any XML.