Data longevity

by Theresa O’Connor on 22 February 2007
One of two posts in which I muse over things tangentially related to something danah boyd wrote the other day.

Perhaps the obsessive personal ownership of one’s content is nothing more than a fantasy of the techno-elite… I mean, if you’re producing content into a context, do you really want to transfer it wholesale?

danah boyd (emphasis mine)

Like Mark Pilgrim, I’m creating things now that I want to be able to read, hear, watch, search, and filter 50 years from now. Such a desire isn't unique to the digerati — I'm sure my mother would love to be able to keep her data (family recipes, genealogical information, etc.) for the rest of her life.

Unfortunately, the means to effectively act on this desire are more or less only available to us, the hackers. Consider Mark's extensive examples of how hard it is to keep digital data from bit-rotting — and he's an expert if ever there were one. I can't reasonably expect my mother to be able to keep her data long-term, and that's just not good enough.

To me, being able to completely migrate my data — with minimal bit-rot — from system to system is the key in the never-ending and easily-lost fight to keep my data accessible over the entirety of my life.

The moment people begin thinking on the lines of data preservation, they will suddenly realize the lack of easy portability without degradation of quality of their content. This is a major, major pain in the ass. And this is — or ought to be — the topic of discussion.

Chétan Kunte (emphasis his)


There are lots of reasons to take control of your personal content management system: knowledge acquisition, staying on top of the game1, comfort hacking, recreation. But data longevity is the key issue for me. For the same reason Bill de hÓra believes it would be a very good thing if future versions of weblog tools exported content as Atom/RSS and not some custom file format, I'm trying hard to ensure that my blog backend uses only the finest in standard document formats (Atom, HTML, microformats) for storing my content in the first place, thus helping to stave off bit-rot for as long as I can.

Microformat-infused HTML is a particularly good choice of format when data longevity is your goal. Many of the microformats principles and the underlying principles of semantic markup directly impact the potential longevity of your documents. Tantek said (emphasis mine), the goals of any format are data fidelity and interoperability over time and space[…] formats easier for humans lead to better data fidelity. I think formats that are easier for humans to read also lead to better interoperability over time. Long-term data fidelity and interoperability is at the core of Ian Hickson’s reasons for starting the WHATWG effort:

The original reason I got involved in this work is that I realised that the human race has written literally billions of electronic documents, but without ever actually saying how they should be processed. If, in a thousand years, someone found a trove of HTML documents and decided they would right an HTML browser to view them, they couldn’t do it! Even with the existing HTML specs — HTML4, SGML, DOM2 HTML, etc — a perfect implementation couldn’t render the vast majority of documents as they were originally intended.

Every major Web browser vendor today spends at least half the resources allocated to developing the browser on reverse-engineering the other browsers to work out how things should work. For example, if you have:

<p> <b> Hello <i> Crurel </b> World! </i> </p>

…and then you run some JavaScript on that to show what elements there are and what their parents are, what should happen? It turns out that, before HTML5, there were no specs that defined this!

I decided that for the sake of our future generations we should document exactly how to process today’s documents, so that when they look back, they can still reimplement HTML browsers and get our data back, even if they no longer have access to Microsoft Internet Explorer’s source code. (Since IE is proprietary, it is unlikely that its source code will survive far into the future. We may have more luck with some of the other browsers, whose sources are more or less freely available.)

Once I got into actually documenting HTML for the future, I came to see that the effort could also have more immediate benefits, for example today’s browser vendors would be able to use one spec instead of reverse engineering each other; and we could add new features for authors.


One of the things that falls out of this is my focus on making sure that my data is as nice as possible, which can lead the quality of the code running my site to suffer. Building for humans first and machines second means feeding the content-gardening side of me, sometimes at the expense of the code-writing side. Unlike, say, Joe Gregorio and Sam Ruby2, the code driving my blog these days is in no shape for public consumption. Eventually I hope it will be, but my priority has been on the data, not the code.

October 2008 update: Jeremy Keith gave a talk, The Long Web, at the <head> conference this year on similar matters.

Notes

  1. Joe Clark: When teenagers’ hobbyist blogs have better code than brand-new Web sites, somebody’s doing something wrong. And that somebody is you, the developer.
  2. Their blog systems, 1812 and Mombo respectively, are definitely worth checking out.