Publications

Publications
Publications
We strongly believe in open source and giving to our community. We work directly with researchers in academia and seek out new perspectives with our intern and fellowship programs. We generalize our solutions and release them to the world as open source projects. We host discussions and publish our results.

Publications

CMPSCI Technical Report, UM-CS-2009-008, University of Massachusetts, December 2008

Inference and Learning in Large Factor Graphs with Adaptive Proposal Distributions

Khashayar Rohanimanesh, Michael Wick, Andrew McCallum

Large templated factor graphs with complex structure that changes during inference have been shown to provide state-of-the-art experimental results in tasks such as identity uncertainty and information integration. However, inference and learning in these models is notoriously difficult.

This paper formalizes, analyzes and proves convergence for the SampleRank algorithm, which learns extremely efficiently by calculating approximate parameter estimation gradients from each proposed MCMC jump. Next we present a parameterized, adaptive proposal distribution, which greatly increases the number of accepted jumps.

We combine these methods in experiments on a real-world information extraction problem and demonstrate that the adaptive proposal distribution requires 27% fewer jumps than a more traditional proposer.

Keywords
DMAI 2008 11th IEEE International Conf on Comp Info Tech, Khulna, Bangladesh (December 2008)

A Unifying Viewpoint of some Clustering Techniques Using Bregman Divergences and Extensions to Mixed Data Sets

Cecile Levasseur, Brandon Burdge, Ken Kreutz-Delgado, Uwe Mayer

We present a general viewpoint using Bregman divergences and exponential family properties that contains as special cases the three following algorithms: 1) exponential family Principal Component Analysis (exponential PCA), 2) Semi-Parametric exponential family Principal Component Analysis (SP-PCA) and 3) Bregman soft clustering. This framework is equivalent to a mixed data-type hierarchical Bayes graphical model assumption with latent variables constrained to a low-dimensional parameter subspace. We show that within this framework exponential PCA and SPPCA are similar to the Bregman soft clustering technique with the addition of a linear constraint in the parameter space. We implement the resulting modifications to SP-PCA and Bregman soft clustering for mixed (continuous and/or discrete) data sets, and add a nonparametric estimation of the point-mass probabilities to exponential PCA. Finally, we compare the relative performances of the three algorithms in a clustering setting for mixed data sets.

Submitted to the 14th ACM SIGKDD In-ternational Conference on Knowledge Discovery and Data Mining 2008

A Unified Approach for Schema Matching, Coreference and Canonicalization

Michael Wick, Khashayar Rohanimanesh, Karl Schultz, Andrew McCallum

The automatic consolidation of database records from many heterogeneous sources into a single repository requires solving several information integration tasks. Although tasks such as coreference, schema matching, and canonicalization are closely related, they are most commonly studied in isolation.

Systems that do tackle multiple integration problems traditionally solve each independently, allowing errors to propagate from one task to another. In this paper, we describe a discriminatively-trained model that reasons about schema matching, coreference, and canonicalization jointly.

We evaluate our model on a real-world data set of people and demonstrate that simultaneously solving these tasks reduces errors over a cascaded or isolated approach.

Our experiments show that a joint model is able to improve substantially over systems that either solve each task in isolation or with the conventional cascade. We demonstrate nearly a 50% error reduction for coreference and a 40% error reduction for schema matching.

Keywords
Proceedings of the 1st International Workshop on Data Mining and Artificial Intelligence (DMAI 2008) held in conjunction with the 11th IEEE International Conference on Computer and Information Technology. Khulna, Bangladesh. December 2008

A Unifying Viewpoint of some Clustering Techniques Using Bregman Divergences and Extensions to Mixed Data Sets

Uwe Mayer, Cécile Levasseur, Brandon Burge, Ken Kreutz-Delgado, Uwe Mayer, Cécile Levasseur, Brandon Burge, Ken Kreutz-Delgado

We present a general viewpoint using Bregman divergences and exponential family properties that contains as special cases the three following algorithms: 1) exponential family Principal Component Analysis (exponential PCA), 2) Semi-Parametric exponential family Principal Component Analysis (SP-PCA) and 3) Bregman soft clustering.

This framework is equivalent to a mixed data-type hierarchical Bayes graphical model assumption with latent variables constrained to a low-dimensional parameter subspace. We show that within this framework exponential PCA and SPPCA are similar to the Bregman soft clustering technique with the addition of a linear constraint in the parameter space.

We implement the resulting modifications to SP-PCA and Bregman soft clustering for mixed (continuous and/or discrete) data sets, and add a nonparametric estimation of the point-mass probabilities to exponential PCA. Finally, we compare the relative performances of the three algorithms in a clustering setting for mixed data sets.

Keywords
In Proceedings of IEEE Pacific Visualization Symposium, IEEE VGTC, March, 2008, pp.175-182

MobiVis: A Visualization System for Exploring Mobile Data

Zeqian Shen, Kwan-Liu Ma, Zeqian Shen, Kwan-Liu Ma

The widespread use of mobile devices brings opportunities to capture large-scale, continuous information about human behavior. Mobile data has tremendous value, leading to business opportunities, market strategies, security concerns, etc.

Visual analytics systems that support interactive exploration and discovery are needed to extracting insight from the data. However, visual analysis of complex social-spatial-temporal mobile data presents several challenges.

We have created MobiVis, a visual analytics tool, which incorporates the idea of presenting social and spatial information in one heterogeneous network. The system supports temporal and semantic filtering through an interactive time chart and ontology graph, respectively, such that data subsets of interest can be isolated for close-up investigation.

"Behavior rings," a compact radial representation of individual and group behaviors, is introduced to allow easy comparison of behavior patterns. We demonstrate the capability of MobiVis with the results obtained from analyzing the MIT Reality Mining dataset.

Keywords
Proceedings of the 2008 IAPR Workshop on Cognitive Information Processing. pp. 126-131. Santorini, Greece. 2008

Generalized Statistical Methods for Unsupervised Minority Class Detection in Mixed Data Sets

Cecile Levasseur, Uwe Mayer, Brandon Burdge, Ken Kreutz-Delgado

Minority class detection is the problem of detecting the occurrence of rare key events differing from the majority of a data set. This paper considers the problem of unsupervised minority class detection for multidimensional data that are highly nongaussian, mixed (continuous and/or discrete), noisy, and nonlinearly related, such as occurs, for example, in fraud detection in typical financial data.

A statistical modeling approach is proposed which is a subclass of graphical model techniques. It exploits the properties of exponential family distributions and generalizes techniques from classical linear statistics into a framework referred to as Generalized Linear Statistics (GLS). The methodology exploits the split between the data space and the parameter space for exponential family distributions and solves a nonlinear problem by using classical linear statistical tools applied to data that has been mapped into the parameter space.

A fraud detection technique utilizing low-dimensional information learned by using an Iteratively Reweighted Least Squares (IRLS) based approach to GLS is proposed in the parameter space for data of mixed type. ROC curves for an initial simulation on synthetic data are presented, which gives predictions for results on actual financial data sets.

Proceedings of the International Conference on Electronic Commerce ICEC2009. 2009

Discovering Clues for Review Quality from Author’s Behaviors on E-commerce Sites

Shen Huang, Dan Shen, Wei Feng, Yongzheng Zhang, Catherine Baudin, Shen Huang, Dan Shen, Wei Feng, Yongzheng Zhang, Catherine Baudin

With the number of online reviews growing rapidly, it is increasingly difficult to digest all the information within limited time. To help users efficiently get concise information about a product, researchers have studied algorithms for automated opinion summarization.

However, users might expect to further read detailed high-quality reviews in addition to a review outline. This raises another interesting problem not well studied yet: how to discover high quality product reviews? Previous research examined various properties of a product review to predict its quality.

In this paper, we further explore this topic by incorporating another information resource: the behavior of review authors in an e-commerce community. First, we perform a high-level analysis on two kinds of data: product reviews and deal transactions. According to the results of this analysis, three features, including personal reputation, seller degree and expertise degree, are studied to assess the quality of a review from a credibility and expertise perspective.

Our analysis shows that these features are strongly related to review quality and that they can help uncover review spamming by sellers. Furthermore, we propose a simulation model based on the above findings. The model is able to generate the basic properties of the review community, especially when the above three features are taken into account.

Keywords

Pages