Chakrabarti, Soumen ; Dom, Byron ; Indyk, Piotr (1998) Enhanced hypertext categorization using hyperlinks ACM SIGMOD Record, 27 (2). pp. 307-318. ISSN 0163-5808
PDF
1MB |
Official URL: http://doi.org/10.1145/276305.276332
Related URL: http://dx.doi.org/10.1145/276305.276332
Abstract
A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is an essential component of a hypertext database. Hyperlinks pose new problems not addressed in the extensive text classification literature. Links clearly contain high-quality semantic clues that are lost upon a purely term-based classifier, but exploiting link information is non-trivial because it is noisy. Naive use of terms in the link neighborhood of a document can even degrade accuracy. Our contribution is to propose robust statistical models and a relaxation labeling technique for better classification by exploiting link information in a small neighborhood around documents. Our technique also adapts gracefully to the fraction of neighboring documents having known topics. We experimented with pre-classified samples from Yahoo!1 and the US Patent Database2. In previous work, we developed a text classifier that misclassified only 13% of the documents in the well-known Reuters benchmark; this was comparable to the best results ever obtained. This classifier misclassified 36% of the patents, indicating that classifying hypertext can be more difficult than classifying text. Naively using terms in neighboring documents increased error to 38%; our hypertext classifier reduced it to 21%. Results with the Yahoo! sample were more dramatic: the text classifier showed 68% error, whereas our hypertext classifier reduced this to only 21%.
Item Type: | Article |
---|---|
Source: | Copyright of this article belongs to Association for Computing Machinery |
ID Code: | 131005 |
Deposited On: | 02 Dec 2022 06:03 |
Last Modified: | 02 Dec 2022 06:03 |
Repository Staff Only: item control page