Auto-discovery of consensus tags for events
Identity tags v. Descriptive tags
- When you tag something "etech,"
- you don't mean "emerging technology,"
- you mean "Emerging Technology."
- Tech events (SxSW, ETech, etc.) seem to develop these
- via community consensus which emerges organically
Why is this interesting?
- easily find remote resources related to the event
- event photos at flickr
- transcripts / liveblogging
- Auto-suggest meaningful, human-generated tags to other humans
- So let's try to find these tags!
- Get tags related to the event somehow
1. Get tags related to the event
EVDB API gets us the official URL for a tech event
- e.g. Tag Camp's site is
- Get tags of this URL via del.icio.us
- Technorati API points to blog posts which link to the URL
- Look in those blog posts for
2. Filter out the dirty data
- avoid idiosyncratic data
- drop tags only appearing once in relation to this URL
- drop tags only appearing once at all
- drop tags matching known-to-be-generic "stop words"
- get to rely on domain knowledge
- e.g. "technology," "conference"
3. Do some surprisingly easy math
- We want the tags which often appear with this event, but which
don't often appear without the event.
- fitness = tagshare(tag) / popularity(tag)
the proportion of (this tag being used for this URL) to (any tag
being used for this URL)
- tag popularity:
the proportion of (this tag's occurrences) to (the total
number of all tag occurrences)
- To calculate tag popularity, ask Technorati for global occurrence
counts for each tag
So, how did we do?
SxSW, ETech, Tag Camp
Web Essentials, Web 2.0
MacWorld Expo, When 2.0
- In general, I either
- found a decent consensus tag for an event, or
- I didn't find many tags at all related to the event.
- Only looked at first 100 pages returned by Technorati search
- No total tag count from del.icio.us, only observed totals for each tag
- Various web APIs came in handy (EVDB, Technorati)
rel="tag" microformat makes finding blog tags
- Had to scrape del.icio.us
- del.icio.us' API is oriented exclusively toward user's own
- del.icio.us' markup doesn't use
- It's relatively easy to find consensus tags
- You don't need to be a statistician to do it
- easier to locate information related to event
- assist community to converge on consensus tag
- auto-tagging with human-originated tags
- Any questions?
- gratuitous geekery