We strongly believe in open source and giving to our community. We work directly with researchers in academia and seek out new perspectives with our intern and fellowship programs. We generalize our solutions and release them to the world as open source projects. We host discussions and publish our results.


PVLDB 2010

Regret-Minimizing Representative Databases

Atish Das Sarma, Danupon Nanongkai, Ashwin Lall, Richard J.Lipton, Jim Xu, Atish Das Sarma, Danupon Nanongkai, Ashwin Lall, Richard J.Lipton, Jim Xu

We propose the k-representative regret minimization query (k-regret) as an operation to support multi-criteria decision making. Like top-k, the k-regret query assumes that users have some utility or scoring functions; however, it never asks the users to provide such functions.

Like skyline, it filters out a set of interesting points from a potentially large database based on the users' criteria; however, it never overwhelms the users by outputting too many tuples.

In particular, for any number k and any class of utility functions, the k-regret query outputs k tuples from the database and tries to minimize the maximum regret ratio. This captures how disappointed a user could be had she seen k representative tuples instead of the whole database. We focus on the class of linear utility functions, which is widely applicable.

The first challenge of this approach is that it is not clear if the maximum regret ratio would be small, or even bounded. We answer this question affirmatively. Theoretically, we prove that the maximum regret ratio can be bounded and this bound is independent of the database size.

Moreover, our extensive experiments on real and synthetic datasets suggest that in practice the maximum regret ratio is reasonably small. Additionally, algorithms developed in this paper are practical as they run in linear time in the size of the database and the experiments show that their running time is small when they run on top of the skyline operation which means that these algorithm could be integrated into current database systems.

WSDM 2010

A sketch-based distance oracle for web-scale graphs

Atish Das Sarma, Sreenivas Gollapudi, Marc Najork, Rina Panigrahy, Atish Das Sarma, Sreenivas Gollapudi, Marc Najork, Rina Panigrahy

We study the fundamental problem of computing distances between nodes in large graphs such as the web graph and social networks. Our objective is to be able to answer distance queries between pairs of nodes in real time.

Since the standard shortest path algorithms are expensive, our approach moves the time-consuming shortest-path computation offline, and at query time only looks up precomputed values and performs simple and fast computations on these precomputed values.

More specifically, during the offline phase we compute and store a small "sketch" for each node in the graph, and at query-time we look up the sketches of the source and destination nodes and perform a simple computation using these two sketches to estimate the distance.

TCS 2011 (through invitation from best papers of TAMC 2009)

Best-order streaming model

Atish Das Sarma, Richard J.Lipton, Danupon Nanongkai

We study a new model of computation, called best-order stream, for graph problems. Roughly, it is a proof system where a space-limited verifier has to verify a proof sequentially (i.e., it reads the proof as a stream). Moreover, the proof itself is just a specific ordering of the input data.

This model is closely related to many models of computation in other areas such as data streams, communication complexity, and proof checking, and could be used in applications such as cloud computing.

In this paper we focus on graph problems where the input is a sequence of edges. We show that even under this model, checking some basic graph properties deterministically requires linear space in the number of nodes. We also show that, in contrast with this, randomized verifiers are powerful enough to check many graph properties in polylogarithmic space.

Proceedings of the International Conference on Electronic Commerce – EC2009. 2009

Improving Product Review Search Experiences in General Search Engines.

Shen Huang, Dan Shen, Wei Feng, Catherine Baudin, Yongzheng Zhang, Shen Huang, Dan Shen, Wei Feng, Catherine Baudin, Yongzheng Zhang

In the Web 2.0 era, internet users contribute a large amount of online content. Product review is a good example. Since these phenomena are distributed all over shopping sites, weblogs, forums etc., most people have to rely on general search engines to discover and digest others' comments. While conventional search engines work well in many situations, it's not sufficient for users to gather such information.

The reasons include but are not limited to: 1) the ranking strategy does not incorporate product reviews' inherent characteristics, e.g., sentiment orientation; 2) the snippets are neither indicative nor descriptive of user opinions. In this paper, we propose a feasible solution to enhance the experience of product review search.

Based on this approach, a system named "Improved Product Review Search (IPRS)" is implemented on the ground of a general search engine. Given a query on a product, our system is capable of: 1) automatically identifying user opinion segments in a whole article; 2) ranking opinions by incorporating both the sentiment orientation and the topics expressed in reviews; 3) generating readable review snippets to indicate user sentiment orientations; 4) easily comparing products based on a visualization of opinions.

Both results of a usability study and an automatic evaluation show that our system is able to assist users quickly understand the product reviews within limited time.

Neural Information Processing Systems (NIPS), 2009

Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

Michael Wick, Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael Wick, Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum

Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC.

However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL).

Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data.

Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain.

SIAM International Conference on Data Mining 2009 (SDM09)

An Integrated Model for Coreference Resolution and Canonicalization

Michael Wick, Aron Culotta, Khashayar Rohanimanesh, Andrew McCallum, Michael Wick, Aron Culotta, Khashayar Rohanimanesh, Andrew McCallum

Recently, many advanced machine learning approaches have been proposed for coreference resolution; however, all of the discriminatively-trained models reason over mentions rather than entities. That is, they do not explicitly contain variables indicating the “canonical” values for each attribute of an entity (e.g., name, venue, title, etc.).

This canonicalization step is typically implemented as a post-processing routine to coreference resolution prior to adding the extracted entity to a database. In this paper, we propose a discriminatively-trained model that jointly performs coreference resolution and canonicalization, enabling features over hypothesized entities.

We validate our approach on two different coreference problems: newswire anaphora resolution and research paper citation matching, demonstrating improvements in both tasks and achieving an error reduction of up to 62% when compared to a method that reasons about mentions only.

Human Organization, July 2009, Volume 68, Issue 2, p.206-217, 2009

Information Flows in a Gallery-Work-Entertainment Space: The Effect of a Digital Bulletin Board on Social Encounters

Elizabeth Churchill, Les Nelson, Elizabeth Churchill, Les Nelson

Digital media displays are increasingly common in public spaces. Typically, these are minimally interactive and predominantly function as signage or advertisements. However, in our work we have been exploring how digital media public displays can be designed to facilitate community content sharing in civic buildings, in organizations, and at social gatherings like conferences.

While most of our installations have been within fairly formal, professional settings, in this paper we address the impact of a digital community display on interactions between the inhabitants of a neighborhood art gallery and café.

We describe the location, the display itself, and the underlying content distribution and publication infrastructure. Findings from qualitative and quantitative analyses before and after the installation demonstrate that patrons easily adopted use of the display, which was used frequently to find out more about café/gallery events and for playful exchanges.

However, despite the enthusiasm of patrons and café staff, the café owners were wary of maintaining or extending the technology. We speculate on this reticence in terms of potential for services and technologies in public space technology design.