We strongly believe in open source and giving to our community. We work directly with researchers in academia and seek out new perspectives with our intern and fellowship programs. We generalize our solutions and release them to the world as open source projects. We host discussions and publish our results.


Information Economics and Policy, Volume 24, Issue 1, Pages 3–14, March 2012 (and also NBER Working Paper #16507)

Supply Responses to Digital Distribution: Recorded Music and Live Performances

Chris Nosko, Julie Mortimer, Alan Sorensen

No Information

in IEEE Large-scale Data Analysis and Visualization (LDAV) 2012

Visual Analysis of Massive Web Session Data

Zeqian Shen, Jishang Wei, Neel Sundaresan, Kwan-Liu Ma, Zeqian Shen, Jishang Wei, Neel Sundaresan, Kwan-Liu Ma

Tracking and recording users’ browsing behaviors on the web down to individual mouse clicks can create massive web session logs.While such web session data contain valuable information about user behaviors, the ever-increasing data size has placed a big challenge to analyzing and visualizing the data.

An efficient data analysis framework requires both powerful computational analysis and interactive visualization. Following the visual analytics mantra "Analyze first, show the important, zoom, filter and analyze further, details on demand", we introduce a two-tier visual analysis system, TrailExplorer2, to discover knowledge from massive log data.

The system supports a visual analysis process iterating between two steps: querying web sessions and visually analyzing the retrieved data. The query happens at the lower tier where terabytes of web session data are processed in a cluster.

At the upper tier, the extracted web sessions with much smaller scale are visualized on a personal computer for interactive exploration. Our system visualizes a sorted list of web sessions’ temporal patterns and enables data exploration at different levels of details.

The query visualization exploration process iterates until a satisfactory conclusion is achieved. We present two case studies of TrailExplorer2 using real world session data from eBay to demonstrate the system's effectiveness.

in IEEE Visual Analytics Science and Technology (VAST) 2012

Visual Cluster Exploration of Web Clickstream Data

Jishang Wei, Zeqian Shen, Neel Sundaresan, Kwan-Liu Ma, Jishang Wei, Zeqian Shen, Neel Sundaresan, Kwan-Liu Ma

Web clickstream data are routinely collected to study how users browse the web or use a service. It is clear that the ability to recognize and summarize user behavior patterns from such data is valuable to e-commerce companies. In this paper, we introduce a visual analytics system to explore the various user behavior patterns reflected by distinct clickstream clusters.

In a practical analysis scenario, the system first presents an overview of clickstream clusters using a Self-Organizing Map with Markov chain models.

Then the analyst can interactively explore the clusters through an intuitive user interface. He can either obtain summarization of a selected group of data or further refine the clustering result. We evaluated our system using two different datasets from eBay.

Analysts who were working on the same data have confirmed the system’s effectiveness in extracting user behavior patterns from complex datasets and enhancing their ability to reason.

Proceedings of KDD’12, Beijing, China. August 2012

Bootstrapped Language Identification For Multi-Site Internet Domains

We present an algorithm for language identification, in particular of short documents, for the case of an Internet domain with sites in multiple countries with differing languages.

The algorithm is significantly faster than standard language identification methods, while providing state-of-the-art identification. We bootstrap the algorithm based on the language identification based on the site alone, a methodology suitable for any supervised language identification algorithm.

We demonstrate the bootstrapping and algorithm on eBay email data and on Twitter status updates data. The algorithm is deployed at eBay as part of the back-office development data repository.

STOC 2011 (Invited and accepted to SICOMP)

Distributed Verification and Hardness of Distributed Approximation

Atish Das Sarma, Stephan Holzer, Liah Kor, Amos Korman, Danupon Nanongkai, Gopal Pandurangan, David Peleg, Roger Wattenhofer, Atish Das Sarma, Stephan Holzer, Liah Kor, Amos Korman, Danupon Nanongkai, Gopal Pandurangan, David Peleg, Roger Wattenhofer

We study the verification problem in distributed networks, stated as follows. Let H be a subgraph of a network G where each vertex of G knows which edges incident on it are in H. We would like to verify whether H has some properties, e.g., if it is a tree or if it is connected (every node knows in the end of the process whether H has the specified property or not). We would like to perform this verification in a decentralized fashion via a distributed algorithm. The time complexity of verification is measured as the number of rounds of distributed communication.

In this paper we initiate a systematic study of distributed verification, and give almost tight lower bounds on the running time of distributed verification algorithms for many fundamental problems such as connectivity, spanning connected subgraph, and s-t cut verification.

We then show applications of these results in deriving strong unconditional time lower bounds on the hardness of distributed approximation for many classical optimization problems including minimum spanning tree, shortest paths, and minimum cut.

Many of these results are the first non-trivial lower bounds for both exact and approximate distributed computation and they resolve previous open questions. Moreover, our unconditional lower bound of approximating minimum spanning tree (MST) subsumes and improves upon the previous hardness of approximation bound of Elkin [STOC 2004] as well as the lower bound for (exact) MST computation of Peleg and Rubinovich [FOCS 1999]. Our result implies that there can be no distributed approximation algorithm for MST that is significantly faster than the current exact algorithm, for any approximation factor.

Our lower bound proofs show an interesting connection between communication complexity and distributed computing which turns out to be useful in establishing the time complexity of exact and approximate distributed computation of many problems.

PVLDB 2011 (Invited to VLDB Journal Special Issue)

Personalized Social Recommendations – Accurate or Private?

Atish Das Sarma, Ashwin Machanavajjhala, Aleksandra Korolova

With the recent surge of social networks such as Facebook, new forms of recommendations have become possible -- recommendations that rely on one's social connections in order to make personalized recommendations of ads, content, products, and people. Since recommendations may use sensitive information, it is speculated that these recommendations are associated with privacy risks. The main contribution of this work is in formalizing trade-offs between accuracy and privacy of personalized social recommendations.

We study whether "social recommendations", or recommendations that are solely based on a user's social network, can be made without disclosing sensitive links in the social graph. More precisely, we quantify the loss in utility when existing recommendation algorithms are modified to satisfy a strong notion of privacy, called differential privacy. We prove lower bounds on the minimum loss in utility for any recommendation algorithm that is differentially private.

We then adapt two privacy preserving algorithms from the differential privacy literature to the problem of social recommendations, and analyze their performance in comparison to our lower bounds, both analytically and experimentally.

We show that good private social recommendations are feasible only for a small subset of the users in the social network or for a lenient setting of privacy parameters.

PODC 2011

A tight unconditional lower bound on distributed random walk computation

Atish Das Sarma, Danupon Nanongkai, Gopal Pandurangan, Atish Das Sarma, Danupon Nanongkai, Gopal Pandurangan

No Information

SIGIR 2011: 75-84

User behavior in zero-recall ecommerce queries

Gyanit Singh, Nish Parikh, Neel Sundaresan

User expectation and experience for web search and eCommerce (product) search are quite different. Product descriptions are concise as compared to typical web documents. User expectation is more specific to find the right product.

The difference in the publisher and searcher vocabulary (in case of product search the seller and the buyer vocabulary) combined with the fact that there are fewer products to search over than web documents result in observable numbers of searches that return no results (zero recall searches).

In this paper we describe a study of zero recall searches. Our study is focused on eCommerce search and uses data from a leading eCommerce site's user click stream logs.

There are 3 main contributions of our study: 1) The cause of zero recall searches; 2) A study of user's reaction and recovery from zero recall; 3) A study of differences in behavior of power users versus novice users to zero recall searches.

Volume 18, Issue 1, 2011

Rat, rational, or seething cauldron of desire: designing the shopper, interactions

Elizabeth Churchill

In this article, I look at how our shopping experience, both online and offline, is designed.