Auto-discovery of consensus tags for events
Identity tags v. Descriptive tags
- When you tag something "etech,"
- you don't mean "emerging technology,"
- you mean "Emerging Technology."
- Tech events (SxSW, ETech, etc.) seem to develop these
identity tags
- via community consensus which emerges organically
Why is this interesting?
- easily find remote resources related to the event
- event photos at flickr
- transcripts / liveblogging
- podcasts
- etc.
- Auto-suggest meaningful, human-generated tags to other humans
- So let's try to find these tags!
How?
- Get tags related to the event somehow
- ...
- Profit!
1. Get tags related to the event
-
EVDB API gets us the official URL for a tech event
- e.g. Tag Camp's site is
http://tagcamp.org/
- Get tags of this URL via del.icio.us
- Technorati API points to blog posts which link to the URL
- Look in those blog posts for
rel="tag"
2. Filter out the dirty data
- avoid idiosyncratic data
- drop tags only appearing once in relation to this URL
- drop tags only appearing once at all
- drop tags matching known-to-be-generic "stop words"
- get to rely on domain knowledge
- e.g. "technology," "conference"
3. Do some surprisingly easy math
- We want the tags which often appear with this event, but which
don't often appear without the event.
- fitness = tagshare(tag) / popularity(tag)
- tagshare:
the proportion of (this tag being used for this URL) to (any tag
being used for this URL)
- tag popularity:
the proportion of (this tag's occurrences) to (the total
number of all tag occurrences)
- To calculate tag popularity, ask Technorati for global occurrence
counts for each tag
So, how did we do?
SxSW, ETech, Tag Camp
- http://2006.sxsw.com/
sxsw | 19.1% |
interactive | 8.8% |
---|
- http://conferences.oreillynet.com/etech/
- http://tagcamp.org/
tagcamp | 12.5% |
web2.0 | 7.9% |
tags | 6.8% |
---|
Web Essentials, Web 2.0
- http://we05.com/
we05 | 8.3% |
css | 7.7% |
web | 6.2% |
web2.0 | 5.3% |
---|
- http://www.web2con.com/
web2.0 | 21.6% |
web | 7.95% |
internet | 6.13% |
---|
MacWorld Expo, When 2.0
- http://www.macworldexpo.com/live/20/
- http://www.release1-0.com/events/When2index.php
- In general, I either
- found a decent consensus tag for an event, or
- I didn't find many tags at all related to the event.
Known limitations
- Only looked at first 100 pages returned by Technorati search
- No total tag count from del.icio.us, only observed totals for each tag
Reflection
- Various web APIs came in handy (EVDB, Technorati)
rel="tag"
microformat makes finding blog tags
easy
- Had to scrape del.icio.us
- del.icio.us' API is oriented exclusively toward user's own
bookrmarks
- del.icio.us' markup doesn't use
rel="tag"
Summary
- It's relatively easy to find consensus tags
- You don't need to be a statistician to do it
- Applications:
- easier to locate information related to event
- assist community to converge on consensus tag
- auto-tagging with human-originated tags
Thanks!
- Any questions?
- gratuitous geekery