Processing real-world HTML: a quick introduction to `html5lib`

Processing real-world HTML

a quick introduction to html5lib

So you’ve got some HTML.

You found it out in the wild.

Or some user typed it into a form in your webapp.


 <B>Hi, <I>Joe</b>!
 <p/>
 So good to </i><BLINK>finally
 meet you & stuff.

…

You have got to be kidding me.

Tag Soup

Browsers handle such markup well and mostly uniformly.

The browser vendors have spent countless developer-hours reverse-engineering each others’ error recovery methods.

But tool developers have, historically, been screwed.

We usually resort to running text through Tidy, Beautiful Soup, or something similar.

These tools have their own tag soup error recovery, that often doesn’t match what browsers do.

When your error recovery doesn’t match browsers’ error recovery, users get screwed. Your app is buggy.

This was the state of the art in 2004.

HTML 5

Standardizing an HTML parsing algoritm that matches browser behavior.

`html5lib`

An implementation of the HTML 5 parsing algorithm in Ruby and Python (including Python 3).

Really easy to use.


 import html5lib
 f = open("mydocument.html")
 parser = html5lib.HTMLParser()
 document = parser.parse(f)

Just as easy in Ruby.


 require 'html5lib/html5parser'
 include HTML5
 f = File.open("mydocument.html")
 document = HTMLParser.parse(f)

Thousands of tests

Python html5lib implements the spec so well, it even implements an infinite loop. — @gsnedders

fixed in html5lib 8 days ago: revision 21ce65db1e
fixed in HTML5 spec yesterday: r3538

Tree building

Plugs into your favorite DOM or DOM-like API

Python: minidom, ElementTree, lxml, Beafutiful Soup
Ruby: REXML, Hpricot

Tree walking

Python: dom, ElementTree, genshi, lxml, pulldom, Beautiful Soup
Ruby: REXML, Hpricot

Filters

Sanitizer (whitelists)
Conformance checker (validator)

Liberal character set detection (`chardet`)

My skeertuig is vol palings • حَوّامتي مُمْتِلئة بِأَنْقَلَيْسون • Իմ օդաթիռը լի է օձաձկերով • 我的氣墊船裝滿了鱔魚 • Mia kusenveturilo estas plena je angiloj • הרחפת שלי מלאה בצלופחים

Infoset coercion (`ihatexml.py`)

Can happily take in real-world HTML as input into an XML toolchain

Liberal XML parser

Think the Universal Feed Parser, but for any XML.

Questions?

http://tess.oconnor.cx/2009/08/djangosd-html5lib

CC BY-SA 3.0