We strongly believe in open source and giving to our community. We work directly with researchers in academia and seek out new perspectives with our intern and fellowship programs. We generalize our solutions and release them to the world as open source projects. We host discussions and publish our results.


Journal of the Association for Computing Machinery (JACM) – 2013

Distributed Random Walks

Atish Das Sarma, Danupon Nanongkai, Gopal Pandurangan, Prasad Tetali

Performing random walks in networks is a fundamental primitive that has found applications in many areas of computer science, including distributed computing. In this article, we focus on the problem of sampling random walks efficiently in a distributed network and its applications. Given bandwidth constraints, the goal is to minimize the number of rounds required to obtain random walk samples. All previous algorithms that compute a random walk sample of length ℓ as a subroutine always do so naively, that is, in O(ℓ) rounds.

The main contribution of this article is a fast distributed algorithm for performing random walks. We present a sublinear time distributed algorithm for performing random walks whose time complexity is sublinear in the length of the walk. Our algorithm performs a random walk of length ℓ in Õ(√ℓD) rounds (Õ hides polylog n factors where n is the number of nodes in the network) with high probability on an undirected network, where D is the diameter of the network. For small diameter graphs, this is a significant improvement over the naive O(ℓ) bound.

Furthermore, our algorithm is optimal within a poly-logarithmic factor as there exists a matching lower bound [Nanongkai et al. 2011]. We further extend our algorithms to efficiently perform k independent random walks in Õ(√kℓD + k) rounds. We also show that our algorithm can be applied to speedup the more general Metropolis-Hastings sampling. Our random-walk algorithms can be used to speed up distributed algorithms in applications that use random walks as a subroutine. We present two main applications.

First, we give a fast distributed algorithm for computing a random spanning tree (RST) in an arbitrary (undirected unweighted) network which runs in Õ(√mD) rounds with high probability (m is the number of edges). Our second application is a fast decentralized algorithm for estimating mixing time and related parameters of the underlying network. Our algorithm is fully decentralized and can serve as a building block in the design of topologically-aware networks.


Arrival and Departure Dynamics in Social Networks

Shaomei Wu, Atish Das Sarma, Alex Fabrikant, Silvio Lattanzi, Andrew Tomkins

In this paper, we consider the natural arrival and departure of users in a social network, and ask whether the dynamics of arrival, which have been studied in some depth, also explain the dynamics of departure, which are not as well studied.

Through study of the DBLP co-authorship network and a large online social network, we show that the dynamics of departure behave differently from the dynamics of formation.

In particular, the probability of departure of a user with few friends may be understood most accurately as a function of the raw number of friends who are active. For users with more friends, however, the probability of departure is best predicted by the overall fraction of the user's neighborhood that is active, independent of size.

We then study global properties of the sub-graphs induced by active and inactive users, and show that active users tend to belong to a core that is densifying and is significantly denser than the inactive users. Further, the inactive set of users exhibit a higher density and lower conductance than the degree distribution alone can explain. These two aspects suggest that nodes at the fringe are more likely to depart and subsequent departure are correlated among neighboring nodes in tightly-knit communities.

Workshop at WSDM-2014

Data Design for Personalization: Current Challenges and Emerging Opportunities

Elizabeth Churchill, Atish Das Sarma

Personalization is central to most Internet experiences. Personalization is a data-driven process, whether the data are explicitly gathered (e.g., by asking people to fill out forms) or implicitly (e.g. through analysis of behavioral data).

It is clear that designing for effective personalization poses interesting engineering and computer science challenges. However, personalization is also a user experience issue. We believe that encouraging dialogue and collaboration between data mining experts, content providers, and user-focused researchers will offer gains in the area of personalization for search and for other domains.

This workshop is part of a larger effort we are developing: D2D: Data to Design - Design to Data.

Our vision is to provide a forum for researchers and practitioners in computer and systems sciences, data sciences, machine learning, information retrieval, interaction and interface design, and human computer interaction to interact.

Our goal is to explore issues surrounding content and presentation personalization across different devices, and to set an agenda for cross-discipline, collaborative engagement.


On the Embeddability of Random Walk Distances

Adelbert Chang, Atish Das Sarma, Haitao Zheng, Ben Y.Zhao

Analysis of large graphs is critical to the ongoing growth of search engines and social networks. One class of queries centers around node affinity, often quantified by random-walk distances between node pairs, including hitting time, commute time, and personalized PageRank (PPR).

Despite the potential of these "metrics," they are rarely, if ever, used in practice, largely due to extremely high computational costs. In this paper, we investigate methods to scalably and efficiently compute random-walk distances, by "embedding" graphs and distances into points and distances in geometric coordinate spaces.

We show that while existing graph coordinate systems (GCS) can accurately estimate shortest path distances, they produce significant errors when embedding random-walk distances.

Based on our observations, we propose a new graph embedding system that explicitly accounts for per-node graph properties that affect random walk. Extensive experiments on a range of graphs show that our new approach can accurately estimate both symmetric and asymmetric random-walk distances.

Once a graph is embedded, our system can answer queries between any two nodes in 8 microseconds, orders of magnitude faster than existing methods. Finally, we show that our system produces estimates that can replace ground truth in applications with minimal impact on application output.

SIAM 2013

Probabilistic Combination of Classifier and Cluster Ensembles for Non-transductive Learning

Ayan Acharya, Eduardo R.Hruschka, Joydeep Ghosh, Badrul Sarwar, Jean-David Ruvini

Unsupervised models can provide supplementary soft constraints to help classify new target data under the assumption that similar objects in the target set are more likely to share the same class label. Such models can also help detect possible dierences between training and target distributions,

which is useful in applications where concept drift may take place. This paper describes a Bayesian frame work that takes as input class labels from existing classefiers (designed based on labeled data from the source domain),

as well as cluster labels from a cluster ensemble operating solely on the target data to be classified and yields a con-ensus labeling of the target data. This framework is particularly useful when the statistics of the target data drift or change from those of the training data.

We also show that the proposed framework is privacy-aware and allows performing distributed learning when data/models have sharing restrictions. Experiments show that our framework can yield superior results to those provided by applying classifier ensembles only.

In proceedings of The 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2013. 829-836. (Best Paper Award Winner)

Chelsea Won, and You Bought a T-shirt: Characterizing the Interplay Between Twitter and E-Commerce

Haipeng Zhang, Nish Parikh, Neel Sundaresan

The popularity of social media sites like Twitter and Facebook opens up interesting research opportunities for understanding the interplay of social media and e-commerce. Most research on online behavior, up until recently, has focused mostly on social media behaviors and e-commerce behaviors independently.

In our study we choose a particular global ecommerce platform (eBay) and a particular global social media platform (Twitter). We quantify the characteristics of the two individual trends as well as the correlations between them.

We provide evidences that about 5% of general eBay query streams show strong positive correlations with the corresponding Twitter mention streams, while the percentage jumps to around 25% for trending eBay query streams. Some categories of eBay queries, such as 'Video Games' and 'Sports', are more likely to have strong correlations.

We also discover that eBay trend lags Twitter for correlated pairs and the lag differs across categories. We show evidences that celebrities' popularities on Twitter correlate well with their relevant search and sales on eBay.

The correlations and lags provide predictive insights for future applications that might lead to instant merchandising opportunities for both sellers and e-commerce platforms.