WWE 2006 | 3rd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics

Kazuyuki Narisawa, Yasuhiro Yamada, Daisuke Ikeda and Masayuki Takeda

This paper addresses the problem of detecting blog spams, which are unsolicited messages on blog sites, among blog entries. Unlike a spam mail, a typical blog spam is produced to increase the PageRank for the spammer's Web sites, and so many copies of the blog spam are necessary and all of them contain URLs of the sites. Therefore the number of the copies, we call it the frequency, seems to be a good key to find this type of blog spams. The frequency is not, however, sufficient for detection algorithms which detect an entry as a blog spam if the frequency is greater than some threshold value, because of the following reasons: it is very difficult to collect Web pages including all copies of a blog entry; therefore an input data contains only a few copies of the entry whose number may be smaller than the predefined threshold; and thus a frequency based spam detection algorithm fails to detect. Instead of frequency based approaches, we propose a spam detection method based on the vocabulary size, which is the number of substrings whose frequencies are the same. The proposed method utilizes the fact that the vocabulary size of substrings in normal blog entries follows the Zipf's distribution but the vocabulary size in blog spams does not. We show its effectiveness by experiments, using both artificial data and Web data collected from actual blog entries. Experiments using Web data show that the proposed method can detect a blog spam even if the frequency of it is not so large, and that the method finds all blog spams with some copies simultaneously in given blog entries. A blog spam written in Chinese, which seems to be advertisements for Chinese movies, is found from an English blog site. This result shows that the proposed method is independent from the language. We also show the scalability of the proposed method with respect to input size using a huge size of text data.