Chakrabarti, Soumen ; Puniyani, Kriti ; Das, Sujatha (2006) Optimizing scoring functions and indexes for proximity search in type-annotated corpora In: WWW '06 Proceedings of the 15th International Conference on World Wide Web, May 23-26, Edinburgh, Scotland.
|
PDF
- Other
304kB |
Official URL: http://dl.acm.org/citation.cfm?id=1135882
Abstract
We introduce a new, powerful class of text proximity queries: find an instance of a given "answer type" (person, place, distance) near "selector" tokens matching given literals or satisfying given ground predicates. An example query is type=distance NEAR Hamburg Munich. Nearness is defined as a flexible, trainable parameterized aggregation function of the selectors, their frequency in the corpus, and their distance from the candidate answer. Such queries provide a key data reduction step for information extraction, data integration, question answering, and other text-processing applications. We describe the architecture of a next-generation information retrieval engine for such applications, and investigate two key technical problems faced in building it. First, we propose a new algorithm that estimates a scoring function from past logs of queries and answer spans. Plugging the scoring function into the query processor gives high accuracy: typically, an answer is found at rank 2-4. Second, we exploit the skew in the distribution over types seen in query logs to optimize the space required by the new index structures required by our system. Extensive performance studies with a 10GB, 2-million document TREC corpus and several hundred TREC queries show both the accuracy and the efficiency of our system. From an initial 4.3GB index using 18,000 types from WordNet, we can discard 88% of the space, while inflating query times by a factor of only 1.9. Our final index overhead is only 20% of the total index space needed.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Source: | Copyright of this article belongs to article belongs to WWW '06 Proceedings of the 15th International Conference, Association for Computing Machinery. |
Keywords: | Indexing; Annotated; Text |
ID Code: | 100082 |
Deposited On: | 12 Feb 2018 12:27 |
Last Modified: | 12 Feb 2018 12:27 |
Repository Staff Only: item control page