Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction

Chakrabarti, Soumen (2011) Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction In: WWW '01 Proceedings of the 10th International Conference on World Wide Web, May 01 - 05, 2001, Hong Kong.

[img]
Preview
PDF - Other
399kB

Official URL: http://dl.acm.org/citation.cfm?id=372054

Abstract

Topic distillation is the process of finding authoritative Web pages and comprehensive “hubs” which reciprocally endorse each other and are relevant to a given query. Hyperlink-based topic distillation has been traditionally applied to a macroscopic Web model where documents are nodes in a directed graph and hyperlinks are edges. Macroscopic models miss valuable clues such as banners, navigation panels, and template-based inclusions, which are embedded in HTML pages using markup tags. Consequently, results of macroscopic distillation algorithms have been deteriorating in quality as Web pages are becoming more complex. We propose a uniform fine-grained model for the Web in which pages are represented by their tag trees (also called their Document Object Models or DOMs) and these DOM trees are interconnected by ordinary hyperlinks. Surprisingly, macroscopic distillation algorithms do not work in the finegrained scenario. We present a new algorithm suitable for the fine-grained model. It can dis-aggregate hubs into coherent regions by segmenting their DOM trees. Mutual endorsement between hubs and authorities involve these regions, rather than single nodes representing complete hubs. Anecdotes and measurements using a 28-query, 366000-document benchmark suite, used in earlier topic distillation research, reveal two benefits from the new algorithm: distillation quality improves and a by-product of distillation is the ability to extract relevant snippets from hubs which are only partially relevant to the query.

Item Type:Conference or Workshop Item (Paper)
Source:Copyright of this article belongs to WWW '01 Proceedings of the 10th International Conference, Association for Computing Machinery.
Keywords:Topic Distillation; Document Object Model; Segmentation; Minimum Description Length Principle
ID Code:100110
Deposited On:12 Feb 2018 12:28
Last Modified:12 Feb 2018 12:28

Repository Staff Only: item control page