Enhanced topic distillation using text, markup tags and hyperlinks

Chakrabarti, Soumen ; Joshi, Mukul ; Tawde, Vivek (2001) Enhanced topic distillation using text, markup tags and hyperlinks In: SIGIR '01 Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, Louisiana, USA.

Full text not available from this repository.

Official URL: http://dl.acm.org/citation.cfm?id=383990&dl=GUIDE&...

Abstract

Topic distillation is the analysis of hyperlink graph structure to identify mutually reinforcing authorities (popular pages) and hubs (comprehensive lists of links to authorities). Topic distillation is becoming common in Web search engines, but the best-known algorithms model the Web graph at a coarse grain with whole pages as single nodes. Such models may lose vital details in the markup tag structure of the pages, and thus lead to a tightly linked irrelevant subgraph winning over a relatively sparse relevant subgraph, a phenomenon called topic drift or contamination. The problem gets especially severe in the face of increasingly complex pages with navigation panels and advertisement links. We present an enhanced topic distillation algorithm which analyzes text, the markup tag trees that constitute HTML pages, and hyperlinks between pages. It thereby identifies subtrees which have high text- and hyperlink-based coherence w.r.t. the query. These subtrees get preferential treatment in the mutual reinforcement process. Using over 50 queries, 28 from earlier topic distillation work, we analyzed over 700,000 pages and obtained quantitative and anecdotal evidence that the new algorithm reduces topic drift.

Item Type:Conference or Workshop Item (Paper)
Source:Copyright of this article belongs to SIGIR '01 Proceedings of the 24th Annual International ACM SIGIR Conference.
ID Code:100108
Deposited On:12 Feb 2018 12:28
Last Modified:12 Feb 2018 12:28

Repository Staff Only: item control page