Faster and smaller inverted indices with treaps.

SIGIR 2013: 193-202
Faster and smaller inverted indices with treaps.
Roberto Konow, Gonzalo Navarro, Charles L. A. Clarke, Alejandro López-Ortiz
eBay Authors

We introduce a new representation of the inverted index that performs faster ranked unions and intersections while using less space. Our index is based on the treap data structure, which allows us to intersect/merge the document identifiers while simultaneously thresholding by frequency, instead of the costlier two-step classical processing methods. To achieve compression we represent the treap topology using compact data structures. Further, the treap invariants allow us to elegantly encode differentially both document identifiers and frequencies. Results show that our index uses about 20% less space, and performs queries up to three times faster, than state-of-the-art compact representations.

Another publication from the same author: Roberto Konow

Information Systems 60: 34-49 (2016)

Aggregated 2D range queries on clustered points.

Nieves R. Brisaboa, Guillermo de Bernardo, Roberto Konow, Gonzalo Navarro, Diego Seco

Efficient processing of aggregated range queries on two-dimensional grids is a common requirement in information retrieval and data mining systems, for example in Geographic Information Systems and OLAP cubes. We introduce a technique to represent grids supporting aggregated range queries that requires little space when the data points in the grid are clustered, which is common in practice. We show how this general technique can be used to support two important types of aggregated queries, which are ranked range queries and counting range queries. Our experimental evaluation shows that this technique can speed up aggregated queries up to more than an order of magnitude, with a small space overhead.


Another publication from the same category: Machine Learning and Data Science

WWW '17 Perth Australia April 2017

Drawing Sound Conclusions from Noisy Judgments

David Goldberg, Andrew Trotman, Xiao Wang, Wei Min, Zongru Wan

The quality of a search engine is typically evaluated using hand-labeled data sets, where the labels indicate the relevance of documents to queries. Often the number of labels needed is too large to be created by the best annotators, and so less accurate labels (e.g. from crowdsourcing) must be used. This introduces errors in the labels, and thus errors in standard precision metrics (such as P@k and DCG); the lower the quality of the judge, the more errorful the labels, consequently the more inaccurate the metric. We introduce equations and algorithms that can adjust the metrics to the values they would have had if there were no annotation errors.

This is especially important when two search engines are compared by comparing their metrics. We give examples where one engine appeared to be statistically significantly better than the other, but the effect disappeared after the metrics were corrected for annotation error. In other words the evidence supporting a statistical difference was illusory, and caused by a failure to account for annotation error.