Learning to extract information from large websites using sequential models

Vydiswaran, Vinod ; Sarawagi, Sunita (2005) Learning to extract information from large websites using sequential models In: Eleventh International Conference on Management of Data.

Full text not available from this repository.

Abstract

We propose a new method of information extraction from large websites by learning the sequence of links that lead to a specific goal page on the website. Sample applications include finding computer science publications starting from university root pages and fetching addresses of companies on a web database. We model the website as a graph on a set of important states chosen via domain knowledge and train a Conditional Random Field (CRF) over it. The conditional exponen- tial models of CRFs enable us to exploit a variety of fea- tures including keywords and patterns extracted from and around hyperlinks and HTML pages and any sequential or- derings amongst states. Our technique provides two times better harvest rates than techniques used in generic focused crawlers.

Item Type:	Conference or Workshop Item (Paper)
Source:	Copyright of this article belongs to ResearchGate GmbH
ID Code:	128400
Deposited On:	20 Oct 2022 05:29
Last Modified:	14 Nov 2022 11:24

Repository Staff Only: item control page