Indexing and querying over versioned text
Current Information Retrieval systems use inverted index
structures for effcient query processing. Due to the extremely large size of many data sets, these index structures
are usually kept in compressed form, and many techniques
for optimizing compressed size and query processing speed
have been proposed. In this work, we focus on versioned
document collections, that is, collections where each document is modified over time, resulting in multiple versions of
the document. Such examples include Wikipedia and historical web pages stored in Internet Archive.
We categorize our work into the following serveral layers.
- Effective index representation. In versioned collections, a document usually is represented by multiple versions which may potienally make the index size very big.
In our exploration, we find that usually consecutive versions of the same document are similar, and several researchers have explored ideas
for exploiting this similarity to decrease index size. We propose new index compression techniques for versioned document collections that achieve reductions in index size over previous methods.
- Efficient index traversal. Based on the efficient index compression. We build an efficient query processing on top by employing the index structure.
- Support for temporal range queries. Search queries over versioned document collections often use keywords as well as temporal constraints, most commonly a time range of interest.
TemaOn top of the efficient query processing on versioned document. By employing our index structure and index traversal over versioned text, we achieved a faster temporal queires compared to
the previous research.
- Aggregate temporal query processing. On top of the temporal range queries, there are even more complicated query processing. e.g durable Top-k (U. etc 2010) which required to fetch
top-k documents in a range of time. We show that such query processing can be efficiently support by our temporal range queries framework.
people to contact Jinru He
Primary publications:
-
J. He,
H. Yan, and
T. Suel. (2009)
Compact
Full-Text Indexing of Versioned Document Collections.
In Proceedings of the 18th ACM
Conference on Information and Knowledge Management (CIKM),
pp. 415-424,
Hong Kong, ACM Press, November.
-
J. He,
J. Zeng
and
T. Suel. (2010)
Improved Index Compression
Techniques for Versioned Document Collections.
In Proceedings of the
19th ACM
Conference on Information and Knowledge Management (CIKM),
pages 1239-1248,
Toronto, Canada, October.
-
J. He
and
T. Suel. (2011)
Faster Temporal Range Queries over Versioned Text.
To be published in Proceedings of the
34th Annual International ACM
SIGIR Conference on Research and
Development on Information Retrieval, Beijing, China, July.
Source codes used in CIKM09 and CIKM10 paper can be found here: CIKM.zip
This research was supported by NSF Grant IIS-0803605, "Efficient and Effective Search Services over ArchivalWebs", a grant from Google, and by NYU seed grant. We also thank the Internet Archive
for providing access to the Ireland data set.
Last modified: 18 June 2011