We strongly believe in open source and giving to our community. We work directly with researchers in academia and seek out new perspectives with our intern and fellowship programs. We generalize our solutions and release them to the world as open source projects. We host discussions and publish our results.



On the Embeddability of Random Walk Distances

Adelbert Chang, Atish Das Sarma, Haitao Zheng, Ben Y.Zhao

Analysis of large graphs is critical to the ongoing growth of search engines and social networks. One class of queries centers around node affinity, often quantified by random-walk distances between node pairs, including hitting time, commute time, and personalized PageRank (PPR).

Despite the potential of these "metrics," they are rarely, if ever, used in practice, largely due to extremely high computational costs. In this paper, we investigate methods to scalably and efficiently compute random-walk distances, by "embedding" graphs and distances into points and distances in geometric coordinate spaces.

We show that while existing graph coordinate systems (GCS) can accurately estimate shortest path distances, they produce significant errors when embedding random-walk distances.

Based on our observations, we propose a new graph embedding system that explicitly accounts for per-node graph properties that affect random walk. Extensive experiments on a range of graphs show that our new approach can accurately estimate both symmetric and asymmetric random-walk distances.

Once a graph is embedded, our system can answer queries between any two nodes in 8 microseconds, orders of magnitude faster than existing methods. Finally, we show that our system produces estimates that can replace ground truth in applications with minimal impact on application output.

Tutorial at CIKM-2013

Beyond Skylines and Top-k Queries: Representative Databases and e-Commerce Product Search

Atish Das Sarma, Ashwin Lall, Nish Parikh, Neel Sundaresan

Skyline queries have been a topic of intense study in the database area for over a decade now. Similarly, the top-k retrieval query has been heavily investigated by both the database as well as the web research communities. This tutorial will delve into the background of these two areas, and specifically focus on the recent challenges with respect to returning a small set of results to users, as well as requiring minimal intervention or input from them.

These are two main concerns with skylines and top-k respectively, and therefore have drawn a great deal of attention in the recent years with several interesting ideas being proposed in the research community. This tutorial will cover the current approaches to representative database selection. We will focus on both the theoretical models as well as the practical aspects from an industry standpoint.

The topics of covered in this tutorial will include identifying representative subsets of the skyline set, interaction based approaches, e-commerce product search, and leveraging aggregate user preference statistics.

SIAM 2013

Probabilistic Combination of Classifier and Cluster Ensembles for Non-transductive Learning

Ayan Acharya, Eduardo R.Hruschka, Joydeep Ghosh, Badrul Sarwar, Jean-David Ruvini

Unsupervised models can provide supplementary soft constraints to help classify new target data under the assumption that similar objects in the target set are more likely to share the same class label. Such models can also help detect possible dierences between training and target distributions,

which is useful in applications where concept drift may take place. This paper describes a Bayesian frame work that takes as input class labels from existing classefiers (designed based on labeled data from the source domain),

as well as cluster labels from a cluster ensemble operating solely on the target data to be classified and yields a con-ensus labeling of the target data. This framework is particularly useful when the statistics of the target data drift or change from those of the training data.

We also show that the proposed framework is privacy-aware and allows performing distributed learning when data/models have sharing restrictions. Experiments show that our framework can yield superior results to those provided by applying classifier ensembles only.

Lightning Talk and Poster @ Extremely Large Databases XLDB 2013.

Building a Network of E-commerce Concepts

Sandip Gaikwad, Sanjay Ghatare, Nish Parikh, Rajendra Shinde

We present a method for developing a network of e-commerce concepts. We define concepts as collection of terms that represent product entities or commerce ideas that users are interested in. We start by looking at large corpora (Billions) of historical eBay buyer queries and seller item titles.

We approach the problem of concept extraction from corpora as a market-baskets problem by adapting statistical measures of support and confidence. The concept-centric meta-data extraction pipeline is built over a map-reduce framework. We constrain the concepts to be both popular and concise.

Evaluation of our algorithm shows that high precision concept sets can be automatically mined. The system mines the full spectrum of precise e-commerce concepts ranging all the way from "ipod nano" to "I'm not a plastic bag" and from "wakizashi sword" to "mastodon skeleton".

Once the concepts are detected, they are linked into a network using different metrics of semantic similarity between concepts. This leads to a rich network of e-commerce vocabulary. Such a network of concepts can be the basis of enabling powerful applications like e-commerce search and discover as well as automatic e-commerce taxonomy generation. We present details about the extraction platform, and algorithms for segmentation of short snippets of e-commerce text as well as detection and linking of concepts.

In proceedings of The 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2013. 829-836. (Best Paper Award Winner)

Chelsea Won, and You Bought a T-shirt: Characterizing the Interplay Between Twitter and E-Commerce

Haipeng Zhang, Nish Parikh, Neel Sundaresan

The popularity of social media sites like Twitter and Facebook opens up interesting research opportunities for understanding the interplay of social media and e-commerce. Most research on online behavior, up until recently, has focused mostly on social media behaviors and e-commerce behaviors independently.

In our study we choose a particular global ecommerce platform (eBay) and a particular global social media platform (Twitter). We quantify the characteristics of the two individual trends as well as the correlations between them.

We provide evidences that about 5% of general eBay query streams show strong positive correlations with the corresponding Twitter mention streams, while the percentage jumps to around 25% for trending eBay query streams. Some categories of eBay queries, such as 'Video Games' and 'Sports', are more likely to have strong correlations.

We also discover that eBay trend lags Twitter for correlated pairs and the lag differs across categories. We show evidences that celebrities' popularities on Twitter correlate well with their relevant search and sales on eBay.

The correlations and lags provide predictive insights for future applications that might lead to instant merchandising opportunities for both sellers and e-commerce platforms.

CIKM ’13 Proceedings of the 22nd ACM international conference on Conference on information & knowledge management Pages 1137-1146

On Segmentation of eCommerce Queries

Nish Parikh, Prasad Sriram, Mohammad AlHasan

In this paper, we present QSEGMENT, a real-life query segmentation system for eCommerce queries. QSEGMENT uses frequency data from the query log which we call buyers′ data and also frequency data from product titles what we call sellers′ data.

We exploit the taxonomical structure of the marketplace to build domain specific frequency models. Using such an approach, QSEGMENT performs better than previously described baselines for query segmentation.

Also, we perform a large scale evaluation by using an unsupervised IR metric which we refer to as user-intent-score. We discuss the overall architecture of QSEGMENT as well as various use cases and interesting observations around segmenting eCommerce queries.

in Proceedings of the 22nd international conference on World Wide Web (WWW ’13)

Anatomy of a Web-Scale Resale Market: A Data Mining Approach

Yuchen Zhao, Neel Sundaresan, Zeqian Shen, Philip Yu

Reuse and remarketing of content and products is an integral part of the internet. As E-commerce has grown, online resale and secondary markets form a significant part of the commerce space. The intentions and methods for reselling are diverse. In this paper, we study an instance of such markets that affords interesting data at large scale for mining purposes to understand the properties and patterns of this online market.

As part of knowledge discovery of such a market, we first formally propose criteria to reveal unseen resale behaviors by elastic matching identification (EMI) based on the account transfer and item similarity properties of transactions.

Then, we present a large-scale system that leverages MapReduce paradigm to mine millions of online resale activities from petabyte scale heterogeneous ecommerce data. With the collected data, we show that the number of resale activities leads to a power law distribution with a ‘long tail’, where a significant share of users only resell in very low numbers and a large portion of resales come from a small number of highly active resellers.

We further conduct a comprehensive empirical study from different aspects of resales, including the temporal, spatial patterns, user demographics, reputation and the content of sale postings. Based on these observations, we explore the features related to “successful” resale transactions and evaluate if they can be predictable.

We also discuss uses of this information mining for business insights and user experience on a real-world online marketplace.

To appear in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2013.

Large-Scale Video Summarization Using Web-Image Priors

Aditya Khosla, Raffay Hamid, Chih-Jen Lin, Neel Sundaresan
Given the enormous growth in user-generated videos, it is becoming increasingly important to be able to navigate them efficiently. As these videos are generally of poor quality, summarization methods designed for well-produced videos do not generalize to them. To address this challenge, we propose to use web-images as a prior to facilitate summarization of user-generated videos.
Our main intuition is that people tend to take pictures of objects to capture them in a maximally informative way. Such images could therefore be used as prior information to summarize videos containing a similar set of objects.
In this work, we apply our novel insight to develop a summarization algorithm that uses the web-image based prior information in an unsupervised manner. Moreover, to automatically evaluate summarization algorithms on a large scale, we propose a framework that relies on multiple summaries obtained through crowdsourcing.
We demonstrate the effectiveness of our evaluation framework by comparing its performance to that of multiple human evaluators. Finally, we present results for our framework tested on hundreds of user-generated videos.