Vydiswaran, Vinod ; Sarawagi, Sunita (2005) Learning to extract information from large websites using sequential models In: Eleventh International Conference on Management of Data.
Full text not available from this repository.
Abstract
We propose a new method of information extraction from large websites by learning the sequence of links that lead to a specific goal page on the website. Sample applications include finding computer science publications starting from university root pages and fetching addresses of companies on a web database. We model the website as a graph on a set of important states chosen via domain knowledge and train a Conditional Random Field (CRF) over it. The conditional exponen- tial models of CRFs enable us to exploit a variety of fea- tures including keywords and patterns extracted from and around hyperlinks and HTML pages and any sequential or- derings amongst states. Our technique provides two times better harvest rates than techniques used in generic focused crawlers.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Source: | Copyright of this article belongs to ResearchGate GmbH |
ID Code: | 128400 |
Deposited On: | 20 Oct 2022 05:29 |
Last Modified: | 14 Nov 2022 11:24 |
Repository Staff Only: item control page