Thoughts on converting HTML to Atom

For better or worse, HTML is The Format. So why not use HTML as your underlying data format, producing other formats from it as needed? I’m convinced that this is how we should be thinking about webarch long-term. Other formats come and go, while HTML remains an unbroken line.

Ironically, my blog software uses Atom for its backing store.

With Microformats, of course, this is already being done. Authors link to services like H2VX to provide people with downloadable, subscribe-able vCard and iCalendar feeds. You can use hAtom2Atom to generate an Atom feed from your blog. Software like Falcon use HTML+hAtom for backend storage.

HTML5 contains several new semantic elements, derived in part from Web authors who have used class="" to add additional semantics to their markup. <article> and <time pubdate> provide, in native HTML, similar functionality to what the hAtom microformat provides in HTML 4. (When HTML5’s new elements are more widely implemented, you can expect hAtom to be updated to encourage the use of these new elements where possible.)

Hixie’s not the sort of guy to leave things underspecified, so HTML5 defines (in excruciating detail) how to convert an HTML document to Atom, even when the HTML document in question is, shall we say, less than ideal. Because documents can vary wildly in quality, it is not always possible to extract an appropriate <atom:id> for the articles present. We prefer to handle errors, so in such cases the algorithm allows the HTML→Atom software to make one up. Because different software may make different IDs up, or even the same software invoked in different circumstances, this doesn’t quite live up to the requirements of the Atom spec. This is unfortunate, but is basically unavoidable in any such algorithm (hAtom, for instance, punts on this issue entirely—it only defines the conversion for valid input).