We strongly believe in open source and giving to our community. We work directly with researchers in academia and seek out new perspectives with our intern and fellowship programs. We generalize our solutions and release them to the world as open source projects. We host discussions and publish our results.


SIAM International Conference on Data Mining 2009 (SDM09)

An Integrated Model for Coreference Resolution and Canonicalization

Michael Wick, Aron Culotta, Khashayar Rohanimanesh, Andrew McCallum, Michael Wick, Aron Culotta, Khashayar Rohanimanesh, Andrew McCallum

Recently, many advanced machine learning approaches have been proposed for coreference resolution; however, all of the discriminatively-trained models reason over mentions rather than entities. That is, they do not explicitly contain variables indicating the “canonical” values for each attribute of an entity (e.g., name, venue, title, etc.).

This canonicalization step is typically implemented as a post-processing routine to coreference resolution prior to adding the extracted entity to a database. In this paper, we propose a discriminatively-trained model that jointly performs coreference resolution and canonicalization, enabling features over hypothesized entities.

We validate our approach on two different coreference problems: newswire anaphora resolution and research paper citation matching, demonstrating improvements in both tasks and achieving an error reduction of up to 62% when compared to a method that reasons about mentions only.

Human Organization, July 2009, Volume 68, Issue 2, p.206-217, 2009

Information Flows in a Gallery-Work-Entertainment Space: The Effect of a Digital Bulletin Board on Social Encounters

Elizabeth Churchill, Les Nelson, Elizabeth Churchill, Les Nelson

Digital media displays are increasingly common in public spaces. Typically, these are minimally interactive and predominantly function as signage or advertisements. However, in our work we have been exploring how digital media public displays can be designed to facilitate community content sharing in civic buildings, in organizations, and at social gatherings like conferences.

While most of our installations have been within fairly formal, professional settings, in this paper we address the impact of a digital community display on interactions between the inhabitants of a neighborhood art gallery and café.

We describe the location, the display itself, and the underlying content distribution and publication infrastructure. Findings from qualitative and quantitative analyses before and after the installation demonstrate that patrons easily adopted use of the display, which was used frequently to find out more about café/gallery events and for playful exchanges.

However, despite the enthusiasm of patrons and café staff, the café owners were wary of maintaining or extending the technology. We speculate on this reticence in terms of potential for services and technologies in public space technology design.

Proceedings of the International Conference on Electronic Commerce – EC2009. 2009

Improving Product Review Search Experiences in General Search Engines.

Shen Huang, Dan Shen, Wei Feng, Catherine Baudin, Yongzheng Zhang, Shen Huang, Dan Shen, Wei Feng, Catherine Baudin, Yongzheng Zhang

In the Web 2.0 era, internet users contribute a large amount of online content. Product review is a good example. Since these phenomena are distributed all over shopping sites, weblogs, forums etc., most people have to rely on general search engines to discover and digest others' comments. While conventional search engines work well in many situations, it's not sufficient for users to gather such information.

The reasons include but are not limited to: 1) the ranking strategy does not incorporate product reviews' inherent characteristics, e.g., sentiment orientation; 2) the snippets are neither indicative nor descriptive of user opinions. In this paper, we propose a feasible solution to enhance the experience of product review search.

Based on this approach, a system named "Improved Product Review Search (IPRS)" is implemented on the ground of a general search engine. Given a query on a product, our system is capable of: 1) automatically identifying user opinion segments in a whole article; 2) ranking opinions by incorporating both the sentiment orientation and the topics expressed in reviews; 3) generating readable review snippets to indicate user sentiment orientations; 4) easily comparing products based on a visualization of opinions.

Both results of a usability study and an automatic evaluation show that our system is able to assist users quickly understand the product reviews within limited time.

Neural Information Processing Systems (NIPS), 2009

Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

Michael Wick, Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael Wick, Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum

Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC.

However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL).

Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data.

Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain.

PVLDB 2009

Randomized Multi-pass Streaming Skyline Algorithms

Atish Das Sarma, Ashwin Lall, Danupon Nanongkai, Jim Xu

We consider external algorithms for skyline computation without pre-processing. Our goal is to develop an algorithm with a good worst case guarantee while performing well on average.

Due to the nature of disks, it is desirable that such algorithms access the input as a stream (even if in multiple passes). Using the tools of randomness, proved to be useful in many applications, we present an efficient multi-pass streaming algorithm, RAND, for skyline computation. As far as we are aware, RAND is the first randomized skyline algorithm in the literature.

RAND is near-optimal for the streaming model, which we prove via a simple lower bound. Additionally, our algorithm is distributable and can handle partially ordered domains on each attribute.

Finally, we demonstrate the robustness of RAND via extensive experiments on both real and synthetic datasets. RAND is comparable to the existing algorithms in average case and additionally tolerant to simple modifications of the data, while other algorithms degrade considerably with such variation.

Proceedings of the Nineteenth IEEE Int’l Workshop on Machine Learning for Signal Processing, Grenoble, France. September 2009

Classifying non-Gaussian and Mixed Data Sets in their Natural Parameter Space

Cécile Levasseur, Uwe Mayer, Ken Kreutz-Delgado

We consider the problem of both supervised and unsupervised classification for multidimensional data that are nongaussian and of mixed types (continuous and/or discrete). An important subclass of graphical model techniques called Generalized Linear Statistics (GLS) is used to capture the underlying statistical structure of these complex data.

GLS exploits the properties of exponential family distributions, which are assumed to describe the data components, and constrains latent variables to a lower dimensional parameter subspace.

Based on the latent variable information, classification is performed in the natural parameter subspace with classical statistical techniques. The benefits of decision making in parameter space is illustrated with examples of categorical data text categorization and mixed-type data classification.

As a text document preprocessing tool, an extension from binary to categorical data of the conditional mutual information maximization based feature selection algorithm is presented.