Matt proposed three main topics for discussion:
Weblog data sets
Data quality & spam
The discussion focused on data set issues.
- Tom Lento, Cornell Univ.: we need set of characteristics/vocabulary for describing datasets
- Iadh Ounis, Univ. of Glasgow: weblog data set initiative at TREC;
decided to include spam in order to test spam filtering;
TREC provides uniform framework to compare research from different groups;
adapt data set if necessary
- Matt Hurst, Nielsen BuzzMetrics: standard datasets are key
- Natalie Glance, Nielsen BuzzMetrics: applauds initiative of TREC dataset; what is being learned from it
- Matt Hurst: application-specific requirements
- Andrew Tomkins, Yahoo Research: next question to think about, what are the requirements. Two main set of research areas that require datasets:
(1) basic science: what's the nature of social networks; evolution of friends, etc.
(2) datasets to help the broader world use what is in blogs; the most important problem will be IR; requires dataset w/relevance judgments; don't even understand what the appropriate level of granularity is (post vs. weblog); could then drive significant end user value.
- Rawal Jatz (?): not a single dataset is sufficient for testing an algorithm, because some algorithms work better on different types of data sets (e.g. sparseness)
- Tom Lento: web retrieval data set & algorithms at Cornell university; algorithms w/different properties tested over data sets with different properties
- Chris Brooks, UCSF: would be useful to deliver tools w/ Intelliseek dataset
- Craig Macdonald: description of TREC dataset
100K feeds crawled every week over 11 weeks
didn't use ping servers
3 million permalinks
not as wide-covering as Intelliseek, but longer time period
cost: about 400 Euros (includes hard drive)
- Belle Tseng, NEC: collaborative system for cleaning and annotating the data
very mature data sharing platform
Get action items together to do this.
Tagged as: weblogging2006