Auto-discovery of consensus tags for events

Edward O'Connor

EVDB Inc.

Identity tags v. Descriptive tags

  • When you tag something "etech,"
    • you don't mean "emerging technology,"
    • you mean "Emerging Technology."
  • Tech events (SxSW, ETech, etc.) seem to develop these identity tags
    • via community consensus which emerges organically

Why is this interesting?

  • easily find remote resources related to the event
    • event photos at flickr
    • transcripts / liveblogging
    • podcasts
    • etc.
  • Auto-suggest meaningful, human-generated tags to other humans
  • So let's try to find these tags!

How?

  • Get tags related to the event somehow
  • ...
  • Profit!

1. Get tags related to the event

http://tagcamp.org/ tags at del.icio.us
  • EVDB API gets us the official URL for a tech event
    • e.g. Tag Camp's site is http://tagcamp.org/
  • Get tags of this URL via del.icio.us
  • Technorati API points to blog posts which link to the URL
  • Look in those blog posts for rel="tag"

2. Filter out the dirty data

  • avoid idiosyncratic data
    • drop tags only appearing once in relation to this URL
    • drop tags only appearing once at all
  • drop tags matching known-to-be-generic "stop words"
    • get to rely on domain knowledge
    • e.g. "technology," "conference"

3. Do some surprisingly easy math

  • We want the tags which often appear with this event, but which don't often appear without the event.
  • fitness = tagshare(tag) / popularity(tag)
    • tagshare: the proportion of (this tag being used for this URL) to (any tag being used for this URL)
    • tag popularity: the proportion of (this tag's occurrences) to (the total number of all tag occurrences)
  • To calculate tag popularity, ask Technorati for global occurrence counts for each tag

So, how did we do?

SxSW, ETech, Tag Camp

  • http://2006.sxsw.com/
    sxsw 19.1%
    interactive 8.8%
  • http://conferences.oreillynet.com/etech/
    etech 5%
  • http://tagcamp.org/
    tagcamp 12.5%
    web2.0 7.9%
    tags 6.8%

Web Essentials, Web 2.0

  • http://we05.com/
    we05 8.3%
    css 7.7%
    web 6.2%
    web2.0 5.3%
  • http://www.web2con.com/
    web2.0 21.6%
    web 7.95%
    internet 6.13%

MacWorld Expo, When 2.0

  • http://www.macworldexpo.com/live/20/
    nothing
  • http://www.release1-0.com/events/When2index.php
    nothing
  • In general, I either
    • found a decent consensus tag for an event, or
    • I didn't find many tags at all related to the event.

Known limitations

  • Only looked at first 100 pages returned by Technorati search
  • No total tag count from del.icio.us, only observed totals for each tag

Reflection

  • Various web APIs came in handy (EVDB, Technorati)
  • rel="tag" microformat makes finding blog tags easy
  • Had to scrape del.icio.us
    • del.icio.us' API is oriented exclusively toward user's own bookrmarks
    • del.icio.us' markup doesn't use rel="tag"

Summary

  • It's relatively easy to find consensus tags
  • You don't need to be a statistician to do it
  • Applications:
    • easier to locate information related to event
    • assist community to converge on consensus tag
    • auto-tagging with human-originated tags

Thanks!

CAUTION: Made with secret alien technology - Lisp!