Compressed data structures for annotated web search

Chakrabarti, Soumen ; Kasturi, Sasidhar ; Balakrishnan, Bharath ; Ramakrishnan, Ganesh ; Saraf, Rohit (2012) Compressed data structures for annotated web search In: WWW '12 Proceedings of the 21st International Conference on World Wide Web, April 16 - 20, 2012, Lyon, France.

Full text not available from this repository.

Official URL: http://dl.acm.org/citation.cfm?id=2187854

Abstract

Entity relationship search at Web scale depends on adding dozens of entity annotations to each of billions of crawled pages and indexing the annotations at rates comparable to regular text indexing. Even small entity search benchmarks from TREC and INEX suggest that the entity catalog support thousands of entity types and tens to hundreds of millions of entities. The above targets raise many challenges, major ones being the design of highly compressed data structures in RAM for spotting and disambiguating entity mentions, and highly compressed disk-based annotation indices. These data structures cannot be readily built upon standard inverted indices. Here we present a Web scale entity annotator and annotation index. Using a new workload-sensitive compressed multilevel map, we fit statistical disambiguation models for millions of entities within 1.15GB of RAM, and spend about 0.6 core-milliseconds per disambiguation. In contrast, DBPedia Spotlight spends 158 milliseconds, Wikipedia Miner spends 21 milliseconds, and Zemanta spends 9.5 milliseconds. Our annotation indices use ideas from vertical databases to reduce storage by 30%. On 40x8 cores with 40x3 disk spindles, we can annotate and index, in about a day, a billion Web pages with two million entities and 200,000 types from Wikipedia. Index decompression and scan speed are comparable to MG4J.

Item Type:Conference or Workshop Item (Paper)
Source:Copyright of this article belongs to WWWW'12 Proceedings of the 21st International Conference, Association for Computing Machinery..
ID Code:100018
Deposited On:12 Feb 2018 12:26
Last Modified:12 Feb 2018 12:26

Repository Staff Only: item control page