摘要
目前基于链接关系的排序算法在互联网搜索引擎中占据着至关重要的作用。这类算法的提出是以“链接即投票”的假设为前提的。但是随着互联网商业化发展十几年来,这种假设已经不是万能的了。网页与网页之间不再是“投票”的关系,有各种各样其他角色的链接(即噪链)充斥其中,噪链的存在降低了基于链接的排序算法的准确性,如何识别和处理这些噪链是当今国外研究的新热点。
本文根据噪链本身的分布特性,提出了一种只基于链接关系就能自动识别和过滤噪链的方法,并进行了详细的真实数据集实验验证,结果表明该方法对噪链的识别和过滤非常有效,而且提高了基于链接关系的排序算法的准确性,我们将P@20(排序前20个结果中的相关结果个数)从平均11.8提高到了16.4。
而后,我们进一步把这种方法应用在Web spam的研究上。通过国外公开的公用数据集验证,我们过滤了大多数spam站点,相比一些比较著名的算法来看,我们的方法也非常有竞争力。从而验证了识别和过滤噪链的方法在Web spam研究上应用的可行性。
Abstract
Nowadays, the link-based algorithms for sorting web pages occupy a crucial role in the work of search engine. Such algorithms use the "link as voting" hypothesis as the prerequisite. But with the development of the Internet for more than 10 years, this assumption is not a panacea. And web pages are no longer simply "voting" each other. With the existence of a variety of other links (i.e. noisy links), the accuracy of link-based sorting algorithms has been reduced. How to identify and deal with these noisy links is one of the hot spots in the foreign research area.
In this paper, a solely links-based method is proposed to identify and filter noisy links automatically, and we use detailed experiments to verify our approach. The results show that we can identify and filter the noisy links effectively and improve the ranking considerably. P@20 (the number of relevant results of top 20) is increased from an average of 11.8 to 16.4.
Then, we further apply this method in the study of Web spam. Through the experimental verification of foreign published common data sets, we succeed in filtering out the majority of spam sites. Compared to some well-known algorithms, our approach is also very competitive. Thereby the method of identification and filtering noisy links is verified in the application of anti Web spam study.
全文 Full paper: