Information Extraction

Dimensions

Sarawagi, Sunita (2007) Information Extraction Foundations and Trends in Databases, 1 (3). pp. 261-377. ISSN 1931-7883

PDF
1MB

Official URL: http://doi.org/10.1561/1900000003

Related URL: http://dx.doi.org/10.1561/1900000003

Abstract

The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. The field of information extraction has its genesis in the natural language processing community where the primary impetus came from competitions centered around the recognition of named entities like people names and organization from news articles. As society became more data oriented with easy online access to both structured and unstructured data, new applications of structure extraction came around. Now, there is interest in converting our personal desktops to structured databases, the knowledge in scientific publications to structured records, and harnessing the Internet for structured fact finding queries. Consequently, there are many different communities of researchers bringing in techniques from machine learning, databases, information retrieval, and computational linguistics for various aspects of the information extraction problem. This review is a survey of information extraction research of over two decades from these diverse communities. We create a taxonomy of the field along various dimensions derived from the nature of the extraction task, the techniques used for extraction, the variety of input resources exploited, and the type of output produced. We elaborate on rule-based and statistical methods for entity and relationship extraction. In each case we highlight the different kinds of models for capturing the diversity of clues driving the recognition process and the algorithms for training and efficiently deploying the models. We survey techniques for optimizing the various steps in an information extraction pipeline, adapting to dynamic data, integrating with existing entities and handling uncertainty in the extraction process.

Item Type:	Article
Source:	Copyright of this article belongs to Now publishers inc
ID Code:	128387
Deposited On:	20 Oct 2022 04:08
Last Modified:	14 Nov 2022 11:08

Repository Staff Only: item control page