Tuesday, May 23, 2006

Discussion Session

Led by Matthew Hurst

Matt proposed three main topics for discussion:
Weblog data sets
Data quality & spam
Weblog communities

The discussion focused on data set issues.

- Tom Lento, Cornell Univ.: we need set of characteristics/vocabulary for describing datasets
- Iadh Ounis, Univ. of Glasgow: weblog data set initiative at TREC;
decided to include spam in order to test spam filtering;
TREC provides uniform framework to compare research from different groups;
adapt data set if necessary
- Matt Hurst, Nielsen BuzzMetrics: standard datasets are key
- Natalie Glance, Nielsen BuzzMetrics: applauds initiative of TREC dataset; what is being learned from it
- Matt Hurst: application-specific requirements
- Andrew Tomkins, Yahoo Research: next question to think about, what are the requirements. Two main set of research areas that require datasets:
(1) basic science: what's the nature of social networks; evolution of friends, etc.
(2) datasets to help the broader world use what is in blogs; the most important problem will be IR; requires dataset w/relevance judgments; don't even understand what the appropriate level of granularity is (post vs. weblog); could then drive significant end user value.
- Rawal Jatz (?): not a single dataset is sufficient for testing an algorithm, because some algorithms work better on different types of data sets (e.g. sparseness)
- Tom Lento: web retrieval data set & algorithms at Cornell university; algorithms w/different properties tested over data sets with different properties
- Chris Brooks, UCSF: would be useful to deliver tools w/ Intelliseek dataset

- Craig Macdonald: description of TREC dataset
100K feeds crawled every week over 11 weeks
didn't use ping servers
3 million permalinks
not as wide-covering as Intelliseek, but longer time period
cost: about 400 Euros (includes hard drive)

- Belle Tseng, NEC: collaborative system for cleaning and annotating the data
very mature data sharing platform
Get action items together to do this.


Tagged as:

Browsing System for Weblog Articles based on Automated Folksonomy

Tsutomu Ohkura, Yoji Kiyota and Hiroshi Nakagawa

Folksonomy is a new manual classification scheme based on tagging efforts of users with freely chosen keywords. In folksonomy, a user puts an item (i.e. a photo, a book mark) on a server and shares it with other users. The owner and even the other users can attach tags to this item for their own classification, and they reflect many one’s viewpoints. Since tags are chosen from users’ vocabulary and contain many one’s viewpoints, classification results are easy to understand for ordinary users. As a result, folksonomy serves as an efficient browsing method, because users can grasp the essence of items by looking at the tags. Even though the scalability of folksonomy is much higher than the other manual classification schemes, the method cannot deal with tremendous number of items such as whole weblog articles on the Internet.

For the purpose of solving this problem, we try to automate folksonomy to enhance weblog browsing. We create a "tagger" which is a program to determine whether a particular tag should be attached to an item. In addition, we propose a method to create a candidate tag set, which is a list of tags that may be attached to items, from weblog category names. We achieved around 95% precision compared to a candidate tag set created manually.


Tagged as:

Decomposing Bloggers' Moods: Towards a Time Series Analysis of Moods in the Blogosphere

Krisztian Balog and Maarten de Rijke

Using a total of 20 million mood-annotated blog posts harvested between June 2005 and March 2006, we provide a time series analysis of the number of blog posts annotated with a mood. State-space methods are used to determine decompositions of the time series data associated with bloggers' moods (either individual or aggegrated), allowing us to look for patterns of trend, seasonality and cycle.

Our analysis reveals a broad spectrum of phenomena: (i) there is a clear overall decline in the usage of mood annotations; (ii) weather phenomena and holidays have a clear impact on the profile of some moods; (iii) looking at the relative counts, we observe that some moods are stationary, while others decline or climb; and (iv) several moods display changes in their cyclical or seasonal component during the period covered by our data.


Tagged as:

Extracting Topics From Weblogs Through Frequency Segments

Mizuki Oka, Hirotake Abe and Kazuhiko Kato

Abstract: In this paper, we present an approach to extracting topics from weblogs by using terms that appear in them. We model a term in terms of frequency segments, i.e., sequential occurrences of the term over time, as the unit of characterization. A notable feature of the model is its approximation of changes in the dynamics of term frequencies; it captures the granularity of frequencies from the very beginning of their occurrence. This approximation also makes a comparison of frequency patterns of terms more effective. We report on the results obtained from weblogs that contained an event of global significance i.e., the London bombings of 2005.


Tagged as:

BLOGRANGER - A Multi-Faceted Blog Search Engine

Ko Fujimura, Hiroyuki Toda, Takafumi Inoue, Nobuaki Hiroshima, Ryoji Kataoka and Masayuki Sugizaki

Topics mentioned in blogspace are biased towards interesting/funny or entertainment-related topics compared to articles in the generic web space and there are many personal opinions on products or services. Making good use of these characteristics, we introduce a new blog search engine that provides multiple interfaces, each targeted at a different goal, e.g., topic search, blogger search, and reputation search. To evaluate the effectiveness of the system, we conducted a user survey and collected 2191 answers. For the specific search conducted, BLOGRANGER was seen to be superior to general web search by the ratio of 2 to 1.


Tagged as:

Collaborative Blog Spam Filtering Using Adaptive Percolation Search

Seungyeop Han, Yong-yeol Ahn, Sue Moon and Hawoong Jeong

We propose a novel collaborative ¯ltering method for link spams on blogs. The key idea is to rely on manual identification of spams and share this information about spams through a network of trust. The blogger who has identified a spam tells a small number of fellow bloggers (content implantation), and those who have not heard about it start a search using an adaptive percolation search, combined with content implantation, they contract the information about identified spam in only a fraction of the query period time without producing large volume of traffic.


Tagged as:

Detecting Blog Spams using the Vocabulary Size of All Substrings in Their Copies

Kazuyuki Narisawa, Yasuhiro Yamada, Daisuke Ikeda and Masayuki Takeda

This paper addresses the problem of detecting blog spams, which are unsolicited messages on blog sites, among blog entries. Unlike a spam mail, a typical blog spam is produced to increase the PageRank for the spammer's Web sites, and so many copies of the blog spam are necessary and all of them contain URLs of the sites. Therefore the number of the copies, we call it the frequency, seems to be a good key to find this type of blog spams. The frequency is not, however, sufficient for detection algorithms which detect an entry as a blog spam if the frequency is greater than some threshold value, because of the following reasons: it is very difficult to collect Web pages including all copies of a blog entry; therefore an input data contains only a few copies of the entry whose number may be smaller than the predefined threshold; and thus a frequency based spam detection algorithm fails to detect. Instead of frequency based approaches, we propose a spam detection method based on the vocabulary size, which is the number of substrings whose frequencies are the same. The proposed method utilizes the fact that the vocabulary size of substrings in normal blog entries follows the Zipf's distribution but the vocabulary size in blog spams does not. We show its effectiveness by experiments, using both artificial data and Web data collected from actual blog entries. Experiments using Web data show that the proposed method can detect a blog spam even if the frequency of it is not so large, and that the method finds all blog spams with some copies simultaneously in given blog entries. A blog spam written in Chinese, which seems to be advertisements for Chinese movies, is found from an English blog site. This result shows that the proposed method is independent from the language. We also show the scalability of the proposed method with respect to input size using a huge size of text data.


Tagged as:

Characterizing the Splogosphere

Pranam Kolari, Akshay Java and Tim Finin

Weblogs or blogs collectively constitute the Blogosphere, forming an influential and interesting subset on theWeb. As with most Internet-enabled applications, the ease of content creation and distribution makes the blogosphere spam prone. Spam blogs or splogs are blogs hosting spam posts, created using machine generated or hijacked content for the sole purpose of hosting ads or raising the PageRank of target sites. These splogs make up the splogosphere, and are now inundating blog search engines and update ping servers. In this work we characterize splogs by comparing them against authentic blogs. Our analysis is based on a dataset made publicly available by BlogPulse, and employs a machine learning model that detects splogs with an accuracy of 90%. To round off this analysis and to better understand splogs, we also present our study of a popular blog update ping server, and show how they are overwhelmed by pings sent by splogs. This overall study will facilitate finding effective new techniques to detect and weed out splogs from the blogosphere.


Tagged as:

Discovery of Blog Communities Based on Mutual Awareness

Yu-Ru Lin, Hari Sundaram, Yun Chi, Jun Tatemura and Belle Tseng

Blogs have many fast growing communities on the Internet. Discovering such communities in the blogosphere is important for sustaining and encouraging new blogger participation. We focus on extracting communities based on two key insights - (a) communities form due to individual blogger actions that are mutually observable; (b) semantics of the hyperlink structure are dfferent from traditional web analysis problems. Our approach involves developing computational models for mutual awareness that incorporates the specific action type, frequency and time of occurrence. We use the mutual awareness feature with a ranking-based community extraction algorithm to discover communities. To validate our approach, four performance measures are used on the WWW2006 Blog Workshop dataset and the NEC focused blog dataset with excellent quantitative results. The extracted communities also demonstrate to be semantically cohesive with respect to their topics of interest.


Tagged as:

Experiments on Persian Weblogs

Kyumars Sheykh Esmaili, Mohsen Jamali, Mahmood Neshati, Hassan Abolhassani and Yasaman Soltan-Zadeh

Nowadays users of the Web are encouraged to generate content on the Web by themselves. In fact weblogs are one kind of social networks and they are one of the most important components in Web 2.0. There are a lot of Persian bloggers on the Web. In this paper we have tried to collect their blogs, produce some general statistics about them and have prepared a test bed for further research on weblogs in general and Persianblogs specially.


Tagged as:

Blogs During the London Attacks: Top Information Sources and Topics

Mike Thelwall

Blogs are probably most associated with the high profile postings of a few highly popular bloggers who debate or comment on major news stories, but for each 'A-lister' there are numerous faceless bloggers who write about their own daily lives and/or interests. Hence it is interesting to investigate the extent to which an event with extensive media coverage, such as the London attacks, is reflected in blogspace as a whole. This paper reports a descriptive analysis of blog postings around the London attacks of July 7, 2005. The core of this study is the development of methods to identify and report on bloggers’ activities in a way that is not dominated by prolific bloggers or repetitive blog postings. We report daily trends for the top links and topics for three sets of data: all bloggers’ postings; the postings of bloggers who mentioned London at least once; and the blog postings mentioning London. Although only 5% of active bloggers ever mentioned London by name, the attacks appeared to be the most significant event in blogspace during the two weeks after the initial bombings. Bloggers who posted about London were found to be atypical, linking and posting much more frequently than general bloggers. The results suggest a dichotomy between externally-focused, news-aware approximately daily bloggers and internally-focused diary-like approximately weekly bloggers.


Tagged as:

The Ties that Blog: Examining the Relationship Between Social Ties and Continued Participation in the Wallop Weblogging System

Thomas Lento, Howard T. Welser, Lei Gu and Marc Smith

Are people who remain active as webloggers more socially connected to other users? How are the number and nature of social ties related to people's willingness to continue contributing content to a weblog? This study uses longitudinal data taken from Wallop, a weblogging system developed by Microsoft Research, to explore patterns of user activity. In its year long operation Wallop hosted a naturally occurring opportunity for cultural comparison, as it developed a majority Chinese language using population (despite the English language focus of the system). This allows us to consider whether or not language communities have different social network characteristics that vary along different activity levels. Logistic regression models and network visualizations reveal two key findings. The first is that not all ties are equal. Although a count of incoming comments appears to be a significant predictor of retention, it loses its predictive strength when strong ties created by repeated, reciprocal interaction and ties from other dedicated webloggers are considered. Second, the higher rate of retention among the Chinese language users is partly explained by that population's greater ability to draw in participants with pre-existing social ties. We conclude with considerations for weblogs and directions for future research.


Tagged as:

Leave a Reply: An Analysis of Weblog Comments

Gilad Mishne and Natalie Glance

Abstract: Access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. This overlooks an important aspect distinguishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. In this paper we present a large-scale study of weblog comments and their relation to the posts. Using a sizable corpus of comments, we estimate the overall volume of comments in the blogosphere; analyze the relation between the weblog popularity and commenting patterns in it; and measure the contribution of comment content to various aspects of weblog access.


Tagged as:

Friday, May 19, 2006

See you in Edinburgh!

Only four more days until we meet in Edinburgh for the 3rd Annual Workshop on the Weblogging Ecosystem!

Our workshop starts at 10:30 a.m. on Tuesday, May 23rd, after the opening plenary session.

We do not yet know if the workshop will be held at the Edinburgh International Conference Centre, which is the conference venue, or at the National e-Science Centre. If it ends up being held at the National e-Science Centre, please keep in mind that the Science Centre is 15 minute walk (or 5 minute taxi ride) from the the Conference Centre and plan accordingly. There will be a special fast-track coffee station at the Strathblane Hall on Level 0 in the Conference Centre for workshop participants.

UPDATE: According to the dynamic program it looks like we're in the Ochil rooms (b and c) which is at the EICC (the main venue). Please double check once you get in. Looking forward to seeing everyone!

Tagged as: