Enhanced hypertext categorization using hyperlinks

Chakrabarti, Soumen ; Dom, Byron ; Indyk, Piotr (1998) Enhanced hypertext categorization using hyperlinks ACM SIGMOD Record, 27 (2). pp. 307-318. ISSN 0163-5808

[img] PDF
1MB

Official URL: http://doi.org/10.1145/276305.276332

Related URL: http://dx.doi.org/10.1145/276305.276332

Abstract

A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is an essential component of a hypertext database. Hyperlinks pose new problems not addressed in the extensive text classification literature. Links clearly contain high-quality semantic clues that are lost upon a purely term-based classifier, but exploiting link information is non-trivial because it is noisy. Naive use of terms in the link neighborhood of a document can even degrade accuracy. Our contribution is to propose robust statistical models and a relaxation labeling technique for better classification by exploiting link information in a small neighborhood around documents. Our technique also adapts gracefully to the fraction of neighboring documents having known topics. We experimented with pre-classified samples from Yahoo!1 and the US Patent Database2. In previous work, we developed a text classifier that misclassified only 13% of the documents in the well-known Reuters benchmark; this was comparable to the best results ever obtained. This classifier misclassified 36% of the patents, indicating that classifying hypertext can be more difficult than classifying text. Naively using terms in neighboring documents increased error to 38%; our hypertext classifier reduced it to 21%. Results with the Yahoo! sample were more dramatic: the text classifier showed 68% error, whereas our hypertext classifier reduced this to only 21%.

Item Type:Article
Source:Copyright of this article belongs to Association for Computing Machinery
ID Code:131005
Deposited On:02 Dec 2022 06:03
Last Modified:02 Dec 2022 06:03

Repository Staff Only: item control page