Theresa O’Connor

Representing tags in Atom

In which I try to figure out how to best represent tagging in the Atom syndication format.

Update (March 2009): WordPress.com as has adopted this approach!

Tim Bray asks (links mine):

In Atom, categories have schemes. What scheme should we use for tags?

Since atom:category’s @scheme identifies the means of interpreting @term1, using @scheme to indicate “this atom:category is a tag” seems perfectly reasonable to me. But maybe we should back up a bit. The more general question is how should we represent tags in Atom? Tim makes the same assumption I’ve been making — that atom:category is the natural and correct element for tagging in Atom. While this seems obviously true — I think of tagging as a particular form of categorization — perhaps some other representation would work better. Aristotle Pagaltzis, for instance, proposed the use of atom:link instead. So what’s to be done?

What do we want in a tag representation?

Here are some properties that I think a great representation of tags in Atom would have. Note that I doubt any solution could manage to have all of them.

  1. No elements or attributes outside of those present in RFC 4287 — i.e., no extensions. Attribute values requiring registration or standardization suffer somewhat on this point.

  2. DRY

  3. Would (or at least could) provide both the human-readable and normalized version of the tag. Flickr (and many other sites) normalize tags like “San Diego” to “sandiego” — for example, see this photo of mine.

  4. Would provide a dereferenceable URI to something about the tag. In the typical blog context, a blog post tagged “cat” should have a link to a list of other posts on the same blog tagged “cat.” It would be especially awesome if this link were available in Atom processors unaware of this tagging technique.

  5. The URL structure of tags in the relevant system would be tag space, and the Atom representation of a tag would provide a dereferenceable URI to the tag space.

    Using a tag space for your tags strikes me as nice in several ways. For one, it follows existing practice — flickr and del.icio.us both do so, as do several others. Secondly, tag space URIs are nicely hackable. I often pull up photos of mine by just typing in http://flickr.com/photos/hober/tags/foo, where foo is some tag I vaguely remember placing on the photo. Operator takes advantage of the ubiquity of tag spaces by offering to look up tags that it finds on pages on a variety of services:

    Operator’s handling of a blog post’s “bzr” tag.

    Basically, tag spaces are what make tags truly portable across the Web.

  6. It should be possible for an Atom processor to know that this is a tag and not some other thing, without local knowledge of the site in question.

    That is, it should be possible to distinguish between an atom:category used as a tag and an atom:category used for some other purpose. (The same goes for any other element used to carry a tag.)

  7. It should be possible for an Atom processor to extract the (normalized) tag from the element in which it’s stored without parsing some attribute value or element content into pieces.

  8. All things being equal, atom:category is the preferred element to use, as tagging is a form of categorization — it’s a semantically-appropriate element for tagging.

Tim’s proposal seems to be primarily motivated by #6, and the way his question is phrased strongly implies the importance of #8 as well.

Possible representations

Here are various possible ways to represent tags in Atom, and how they fare against the above list:

  1. <category scheme="http://tess.oconnor.cx/tags/"
              term="foo" label="Foo" />

    This is how I store tags in my blog backend — as atom:category elements with the @scheme http://tess.oconnor.cx/tags/. The @term is the normalized tag, while the @label is the human-readable version (if different than @term).

    I treat this specific @scheme as a tag space — concatenating @scheme with @term produces a dereferencable URI to a page listing posts with the tag. On this model, the mapping from atom:category to the rel-tag microformat seems quite natural:

    <category scheme="http://tess.oconnor.cx/tags/"
              term="foo" label="Foo" />

    becomes

    <a href="http://tess.oconnor.cx/tags/foo"
       rel="tag">Foo</a>

    This technique scores well on points 1, 2, 3, 5, 7, and 8.

    In order to produce a dereferenceable URI to posts with this tag (the sort of URI point 4 wants), an Atom processor would have to somehow know that concatenating @scheme with @term is something it might want to do — there’s no explicit indication of that here. That being said, such practice is already common enough in the Atom world that several people have assumed this to be the standard way to use atom:category — for examples, see these posts on atom-syntax.

    Without specific knowledge of my scheme, an Atom processor has no way of knowing that I’d like it to treat this atom:category as a tag, so this technique fails 6. A global @scheme seems to be required for tag-in-atom:category to pass 6.

  2. <category scheme="urn:tag" term="foo" label="Foo" />

    This, Tim’s proposed technique, scores well on points 2, 3, 6, 7, and 8.

    While this doesn’t introduce any extension elements or attributes, the urn:tag namespace would require standardization (which, admittedly, is underway). This makes point 1 somewhat arguable.

    This technique just doesn’t have any interesting, dereferenceable URIs (points 4 and 5), so it completely releies on the Atom processor to come up with some, and it doesn’t give the entry’s author the opportunity to signal which tag space (if any) he’d prefer.

  3. <category term="foo" label="Foo" />

    This technique is from Henry Story’s comment on Tim’s post: a category is a tag with a namespace. So don’t put a namespace (@scheme) in if you want a tag… (emphasis mine).

    This is basically technique 1, minus the tag space in @scheme. This strategy scores well on points 1, 2, 3, 6 (arguable), 7, and 8.

    As with the previous technique, this loses on 4 and 5 by not containing any interesting, dereferenceable URIs.

    Point 6 is contentious as Atom processors are under no obligation to treat categories without schemes as being tags — AFAICT, an Atom processor would be perfectly conformant to assume categories without schemes to be within some default scheme of theirs. This is especially troublesome in the APP case — I’d expect APP servers to do all sorts of crazy things with such atom:category elements.

    The rest of Henry’s comment implies that he doesn’t think this technique has much to offer itself over technique 1:

    As I see it a category is a tag with a namespace. So don’t put a namespace (scheme) in if you want a tag, but you may as well put the scheme in, since people can always treat your category as a tag (by not querying on the scheme).

  4. <link href="http://tess.oconnor.cx/tags/foo"
          rel="tag" title="Foo" />

    This technique, proposed by Aristotle, has the nice property of being directly analagous to how the rel-tag microformat is marked up in HTML.

    It scores well on points 1 (arguable), 2, 4, and 6.

    While it doesn’t introduce any extension elements or attributes, it requires IANA registration of the “tag” link relation per §7.1 of RFC 4287. So point 1 is debatable.

    It only half-loses on points 3 and 5 — on the one hand, an Atom processor that doesn’t know about the “tag” link relation wouldn’t know what this thing is, so it wouldn’t know how to find the tag space and tag in @href. On the other hand, I imagine the IANA registration for “tag” could specify the same @href parsing rules as the rel-tag microformat, thus providing rel-tag-aware Atom processors the ability to extract the tag space URI and the tag from @href. Atom processors unaware of this link relation could and presumably would display this link to the user, so all is not lost in the fallback case.

    It loses on point 7 for the same reasons outlined in the previous paragraph — extracting the tag space and tag requires knowledge of rel-tag’s @href parsing rules.

    This loses on point 8 — atom:link doesn’t pack the semantic punch of atom:category for representing tags. Though given the rel-tag microformat I don’t think this is that big of a deal.

  5. <category scheme="urn:tag" label="Foo"
              term="http://tess.oconnor.cx/tags/foo" />

    This scores well on points 2, 3, 4, 6, and 8.

    This suffers on point 1 for the same reason as technique 2.

    This suffers on points 5 and 7 for the same reason as technique 4.

  6. <category scheme="http://tess.oconnor.cx/tags/"
              term="http://tess.oconnor.cx/tags/foo"
              label="Foo" />

    This scores well on points 1, 3, 4, 5, and 8.

    This isn’t DRY — the tag space is repeated in two attribute values.

    This suffers on point 6 for the same reason as technique 1, and on point 7 for the same reason as techniques 4 and 5.

The techniques which fare best on point 6 — which appears to be the itch Tim’s trying to scratch — are 2, 3, 4, and 5. I’m guessing he’d eliminate technique 4 as it doesn’t use atom:category. That leaves 2, 3, and 5 for Tim. Now, to me, principles 4 and 5 are more important than 6, so I’m more inclined to support techniques 1, 4, or 6. Err. These are completely different sets of solutions. I think it’ll help to see how actual behavior in the wild matches up with these possible techniques.

Observed behavior in the wild

LiveJournal uses technique 3, without @label.

Vox uses something like technique 1, though with a per-tag specific, non-tag-space scheme. For example, consider the two tags on this post of mine: “meta” and “placeholder.” This is how Vox’s feeds represent them:

<category scheme="http://hober.vox.com/tags/meta/"
          term="meta" label="meta" />
<category scheme="http://hober.vox.com/tags/placeholder/"
          term="placeholder" label="placeholder" />

This seems suboptimal to me.

Blogger doesn’t use tags or categories at all.

WordPress.com appears to be using an early version of WordPress’ Atom 1.0 support — as of this writing, its atom:ids are empty. Its tags look like this:

<category scheme='http://hober.wordpress.com'
          term='Uncategorized' />

So its use of atom:category is similar to technique 1, except that @code isn’t a tag space — simply adding /tag/ to the end of @scheme would fix that, though.

Update (March 2009): as noted in the comments, WordPress.com has adopted this approach, and there's a patch pending for WordPress.org.

Thanks, Andy!

TypePad uses atom:category elements with the @scheme http://www.sixapart.com/ns/types#category, which is not a tag space, and it 404s. Ick.

Let’s see how various atompub members do things:

Granted, the plural of anecdote is not data, but it does look like deployed usage favors technique 1, or something resembling it.


So how do we deal with technique 1’s failure to adhere to principle 6? Maybe we shouldn’t care. Lenny’s comment on Tim’s post struck a chord with me:

Besides, tags hardly ever mean the same thing to two people, so why should they have the same scheme? If some application really thinks that <category scheme="http://example.org/farmer/tag/" term="apple"/> means the same thing as <category scheme="http://example.org/geek/tag/" term="apple"/>, it can just drop the scheme.

The frustrating bit of representing tags in Atom boils down to the difference between the intentional type “tag” and the representation type atom:category.1 Lenny’s comment reveals a way out: if you want to treat an atom:category as a tag, just go ahead and do so — ignore @scheme. Essentially, TAG-EQUAL-P should only compare @term, whereas CATEGORY-EQUAL-P should compare @scheme and @term. Which one you call depends on your purposes, and insofar as tagging goes, actual world usage implies that it’s only the @term that’s important.

Of course, atom:category elements are useful for many more things than tagging. But when representing tags in atom:category elements, using @scheme as a tag space and @term as the tag seems like the best compromise to me.

Notes

  1. See Kent Pitman’s “The Best of Intentions: EQUAL Rights—and Wrongs—in Lisp” for more on the bugs and confusions [that] can be traced to improper attempts to recover intentional type information from representation types. Even if you’re not a Lisper, this is a great article on programming.