Rough set based ensemble prediction for topic specific web crawling

Saha, S. ; Murthy, C. A. ; Pal, S. K. (2009) Rough set based ensemble prediction for topic specific web crawling Proceedings of International Conference on Advances in Pattern Recognition 2009 . pp. 153-156.

Full text not available from this repository.

Related URL: http://dx.doi.org/10.1109/ICAPR.2009.17

Abstract

The rapid growth of the World Wide Web had made the problem of useful resource discovery an important one in recent years. Several techniques such as focused crawling and intelligent crawling have recently been proposed for topic specific resource discovery. All these crawlers use the hypertext features behavior in order to perform topic specific resource discovery. A focused crawler uses the relevance score of the crawled page to score the unvisited URLs extracted from it. The scored URLs are then added to the frontier. Then it picks up the best URL to crawl next. Focused crawlers rely on different types of features of the crawled pages to keep the crawling scope within the desired domain and they are obtained from URL, anchor text, link structure and text contents of the parent and ancestor pages. Different focused crawling algorithms use these different set of features to predict the relevance and quality of the unvisited Web pages. In this article a combined method based on rough set theory has been proposed. It combines the available predictions using decision rules and can build much larger domain-specific collections with less noise. Our experiment in this regard has provided better Harvest rate and better target recall for focused crawling.

Item Type:Article
Source:Copyright of this article belongs to Proceedings of International Conference on Advances in Pattern Recognition 2009, Kolkata, India.
ID Code:77744
Deposited On:14 Jan 2012 12:10
Last Modified:14 Jan 2012 12:10

Repository Staff Only: item control page