Publications

Publications
Publications
We strongly believe in open source and giving to our community. We work directly with researchers in academia and seek out new perspectives with our intern and fellowship programs. We generalize our solutions and release them to the world as open source projects. We host discussions and publish our results.

Publications

SIGIR 2013: 193-202

Faster and smaller inverted indices with treaps.

Roberto Konow, Gonzalo Navarro, Charles L. A. Clarke, Alejandro López-Ortiz

We introduce a new representation of the inverted index that performs faster ranked unions and intersections while using less space. Our index is based on the treap data structure, which allows us to intersect/merge the document identifiers while simultaneously thresholding by frequency, instead of the costlier two-step classical processing methods. To achieve compression we represent the treap topology using compact data structures. Further, the treap invariants allow us to elegantly encode differentially both document identifiers and frequencies. Results show that our index uses about 20% less space, and performs queries up to three times faster, than state-of-the-art compact representations.

Keywords
Data Compression Conference 2013: 351-360

Faster Compact Top-k Document Retrieval

Roberto Konow, Gonzalo Navarro:

An optimal index solving top-k document retrieval [Navarro and Nekrich, SODA'12] takes O(m+k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n-3n bytes, with O(m + (k+log log n)log log n) time, on typical texts. The index is up to 25 times faster than the best previous compressed solutions, and requires at most 5% more space in practice (and in some cases as little as one half). Apart from replacing classical by compressed data structures, our main idea is to replace suffix tree sampling by frequency thresholding to achieve compression.

Keywords
Categories

Physician Incentives and Treatment Choices in Heart Attack Management

Dominic Coey

We estimate how physicians’ financial incentives affect their treatment choices in heart Attack management, using a large dataset of private health insurance claims. Different insurance plans pay physicians different amounts for the same services, generating the required variation in financial incentives.

We begin by presenting evidence that, unconditionally, plans that pay physicians more for more invasive treatments are associated with a considerably larger fraction of such treatments. To interpret this correlation as causal, we continue by showing that it survives conditioning on a rich set of diagnosis and provider-specific variables.

We perform a host of additional checks verifying that differences in unobservable patient or provider characteristics across plans are unlikely to be driving our results. We find that physicians’ treatment choices respond positively to the payments they receive, and that the response is quite large.

If physicians received bundled payments instead of fee-for-service incentives, for example, heart attack management would become considerably more conservative. Our estimates imply that 20 percent of patients would receive different treatments, physician costs would decrease by 27 percent, and social welfare would increase.

Keywords
Categories
To appear in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Mobile Vision Workshop, 2013.

Style Finder: Fine-Grained Clothing Style Recognition and Retrieval

Wei Di, Catherine Wah, Anurag Bhardwaj, Robinson Piramuthu, Neel Sundaresan

With the rapid proliferation of smartphones and tablet computers, search has moved beyond text to other modalities like images and voice. For many applications like Fashion, visual search offers a compelling interface that can capture stylistic visual elements beyond color and pattern that cannot be as easily described using text.

However, extracting and matching such attributes remains an extremely challenging task due to high variability and deformability of clothing items. In this paper, we propose a fine-grained learning model and multimedia retrieval framework to address this problem.

First, an attribute vocabulary is constructed using human annotations obtained on a novel fine-grained clothing dataset. This vocabulary is then used to train a fine-grained visual recognition system for clothing styles.

We report benchmark recognition and retrieval results on Women's Fashion Coat Dataset and illustrate potential mobile applications for attribute-based multimedia retrieval of clothing items and image annotation.

KDD-2013

Palette Power: Enabling Visual Search through Colors

Anurag Bhardwaj, Atish DasSarma, Wei Di, Raffay Hamid, Robinson Piramuthu, Neel Sundaresan

xplosion of mobile devices with cameras, online search has moved beyond text to other modalities like images, voice, and writing. For many applications like Fashion, image-based search offers a compelling interface as compared to text forms by better capturing the visual attributes.

In this paper we present a simple and fast search algorithm that uses color as the main feature for building visual search. We show that low level cues such as color can be used to quantify image similarity and also to discriminate among products with different visual appearances.

We demonstrate the effectiveness of our approach through a mobile shopping application (eBay Fashion App available at https://itunes.apple.com/us/app/ebay-fashion/id378358380?mt=8 and eBay image swatch is the feature indexing millions of real world fashion images).

Our approach outperforms several other state-of-the-art image retrieval algorithms for large scale image data.

To appear in Proceedings of IEEE International Conference on Computer Vision & Pattern Recognition (CVPR) 2013

Dense Non-Rigid Point-Matching Using Random Projections

Raffay Hamid, Dennis DeCoste, Chih-Jen Lin

We present a robust and efficient technique for matching dense sets of points undergoing non-rigid spatial transformations. Our main intuition is that the subset of points that can be matched with high confidence should be used to guide the matching procedure for the rest.

We propose a novel algorithm that incorporates these high-confidence matches as a spatial prior to learn a discriminative subspace that simultaneously encodes both the feature similarity as well as their spatial arrangement.

Conventional subspace learning usually requires spectral decomposition of the pair-wise distance matrix across the point-sets, which can become inefficient even for moderately sized problems. To this end, we propose the use of random projections for approximate subspace learning, which can provide significant time improvements at the cost of minimal precision loss.

This efficiency gain allows us to iteratively find and remove high-confidence matches from the point sets, resulting in high recall. To show the effectiveness of our approach, we present a systematic set of experiments and results for the problem of dense non-rigid image-feature matching.

Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL), 2012.

Structuring E-Commerce Inventory

Karin Maugé, Khashayar Rohanimanesh, Jean-David Ruvini

Large e-commerce enterprises feature millions of items entered daily by a large variety of sellers. While some sellers provide rich, structured descriptions of their items, a vast majority of them provide unstructured natural language descriptions.

In the paper we present a 2 steps method for structuring items into descriptive properties. The first step consists in unsupervised property discovery and extraction. The second step involves supervised property synonym discovery using a maximum entropy based clustering algorithm.

We evaluate our method on a year worth of eCommerce data and show that it achieves excellent precision with good recall.

Keywords
ACL 2012: 805-814

Structuring E-Commerce Inventory

Karin Maugé, Khashayar Rohanimanesh, Jean-David Ruvini

Large e-commerce enterprises feature millions of items entered daily by a large variety of sellers. While some sellers provide rich, structured descriptions of their items, a vast majority of them provide unstructured natural language descriptions.

In the paper we present a 2 steps method for structuring items into descriptive properties. The first step consists in unsupervised property discovery and extraction. The second step involves supervised property synonym discovery using a maximum entropy based clustering algorithm.

We evaluate our method on a year worth of eCommerce data and show that it achieves excellent precision with good recall.

Keywords
CIKM 2012:596-604

Large-scale Item Categorization for e-Commerce

Dan Shen, Jean-David Ruvini, Badrul Sarwar

This paper studies the problem of leveraging computationally intensive classification algorithms for large scale text categorization problems. We propose a hierarchical approach which decomposes the classification problem into a coarse level task and a fine level task.

A simple yet scalable classifier is applied to perform the coarse level classification while a more sophisticated model is used to separate classes at the fine level. However, instead of relying on a human-defined hierarchy to decompose the problem, we use a graph algorithm to discover automatically groups of highly similar classes.

As an illustrative example, we apply our approach to real-world industrial data from eBay, a major e-commerce site where the goal is to classify live items into a large taxonomy of categories.

In such industrial setting, classification is very challenging due to the number of classes, the amount of training data, the size of the feature space and the real world requirements on the response time. We demonstrate through extensive experimental evaluation that (1) the proposed hierarchical approach is superior to flat models, and (2) the data-driven extraction of latent groups works significantly better than the existing human-defined hierarchy.

Keywords

Pages