http://www2009.eprints.org/46/
Compressed Web Indexes
Chierichetti, Flavio
Kumar, Ravi
Raghavan, Prabhakar
Web search engines use indexes to efficiently retrieve pages containing speciﬁed query terms, as well as pages linking to speciﬁed pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well compactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability distribution is Zipﬁan (or a similar power law), since these are the distributions that arise on the web. We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf’s law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law; this is the ﬁrst analysis of compression in indexes under arbitrary distributions. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experiments on several document collections show that the distribution of terms appears to follow a double-Pareto law rather than Zipf’s law. Despite widely varying sets of documents, the index sizes observed in the experiments conform well to our theoretical predictions.
2009-04
Conference or Workshop Item
PeerReviewed
application/pdf
http://www2009.eprints.org/46/1/p451.pdf
Chierichetti, Flavio <http://www2009.eprints.org/view/author/Chierichetti=3AFlavio=3A=3A.html> and Kumar, Ravi <http://www2009.eprints.org/view/author/Kumar=3ARavi=3A=3A.html> and Raghavan, Prabhakar <http://www2009.eprints.org/view/author/Raghavan=3APrabhakar=3A=3A.html> (2009) Compressed Web Indexes. In: 18th International World Wide Web Conference, April 20th-24th, 2009, Madrid, Spain.
http://www2009.eprints.org/46/http://www2009.eprints.org/46/1/