Thoughts on converting HTML to Atom
For better or worse, HTML is The Format. So why not use HTML as your underlying data format, producing other formats from it as needed? I’m convinced that this is how we should be thinking about webarch long-term. Other formats come and go, while HTML remains an unbroken line.
Ironically, my blog software uses Atom for its backing store.
With Microformats, of course, this is already being done. Authors link to services like H2VX to provide people with downloadable, subscribe-able vCard and iCalendar feeds. You can use hAtom2Atom to generate an Atom feed from your blog. Software like Falcon use HTML+hAtom for backend storage.
HTML5
contains several new semantic elements, derived in part from Web
authors who have used class=""
to add additional
semantics to their markup. <article>
and <time
pubdate>
provide, in native HTML,
similar functionality to what the hAtom microformat
provides in HTML 4. (When
HTML5’s new elements are more widely
implemented, you can expect hAtom to be updated to encourage the
use of these new elements where possible.)
Hixie’s not the sort of
guy to leave things underspecified, so HTML5
defines (in excruciating detail) how
to convert an HTML document to Atom, even when
the HTML document in question is, shall we say,
less than ideal. Because documents can vary wildly in quality,
it is not always possible to extract an appropriate
<atom:id>
for the articles present. We prefer to
handle
errors, so in such cases the algorithm allows the
HTML→Atom software to make one up. Because
different software may make different IDs up, or even the same
software invoked in different circumstances, this doesn’t
quite live up to the
requirements of the Atom spec. This is unfortunate, but is
basically unavoidable in any such algorithm (hAtom, for
instance, punts on this issue entirely—it only defines the
conversion for valid input).