Automatic text segmentation for extracting structured records

R. Borkar, Vinayak ; Deshmukh, Kaustubh ; Sarawagi, Sunita (2001) Automatic text segmentation for extracting structured records ACM SIGMOD .

[img] PDF
295kB

Abstract

In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today are human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the warehouse address cleaning problem of transforming dirty addresses stored in large corporate databases as a single text field into subfields like ``City'' and ``Street''. Existing tools rely on hand-tuned, domain-specific rule-based systems. We describe a tool DATAMOLD that learns to automatically extract structure when seeded with a small number of training examples. The tool enhances on Hidden Markov Models (HMM) to build a powerful probabilistic model that corroborates multiple sources of information including, the sequence of elements, their length distribution, distinguishing words from the vocabulary and an optional external data dictionary. Experiments on real-life datasets yielded accuracy of 90% on Asian addresses and 99% on US addresses. In contrast, existing information extraction methods based on rule-learning techniques yielded considerably lower accuracy.

Item Type:Article
Source:Copyright of this article belongs to ResearchGate GmbH
ID Code:128422
Deposited On:20 Oct 2022 09:23
Last Modified:14 Nov 2022 11:49

Repository Staff Only: item control page