We strongly believe in open source and giving to our community. We work directly with researchers in academia and seek out new perspectives with our intern and fellowship programs. We generalize our solutions and release them to the world as open source projects. We host discussions and publish our results.


Journal of Machine Learning Research (JMLR), Volume 6, March 2005

A finite Newton method for fast solution of large scale linear SVMs

Sathiya Keerthi, Dennis DeCoste

This paper develops a fast method for solving linear SVMs with L2 loss function that is suited for large scale data mining tasks such as text classification. This is done by modifying the finite Newton method of Mangasarian in several ways.

Experiments indicate that the method is much faster than decomposition methods such as SVM(light), SMO and BSVM (e.g., 4-100 fold), especially when the number of examples is large. The paper also suggests ways of extending the method to other loss functions such as the modified Huber's loss function and the L1 loss function, and also for solving ordinal regression.

Conference Record of the Thirty-Ninth Asilomar Conference on Signals, Systems and Computers. pp. 545-549. 2005

Data-Pattern Discovery Methods for Detection in Nongaussian High-dimensional Data Sets

Cécile Levasseur, Ken Kreutz-Delgado, Uwe Mayer, Gregory Gancarz

Many important expert system applications depend on the ability to accurately detect or predict the occurrence of key events given a data set of observations. We concentrate on multidimensional data that are highly nongaussian (continuous and/or discrete), noisy and nonlinearly related.

We investigate the feasibility of data-pattern discovery and event detection by applying generalized principal component analysis (GPCA) techniques for pattern extraction based on an exponential family probability distribution assumption.

We develop theoretical extensions of the GPCA model by exploiting results from the theory of generalized linear models and nonparametric mixture density estimation.

The Eighteenth Annual Conference on Neural Information Processing Systems (NIPS04), Vancouver, B.C., Canada, December 2004

Coarticulation in Markov Decision Processes

We investigate an approach for simultaneously committing to multiple activities, each modeled as a temporally extended action in a semi-Markov decision process (SMDP). For each activity we define a set of admissible solutions consisting of the redundant set of optimal policies, and those policies that ascend the optimal state-value function associated with them.

A plan is then generated by merging them in such a way that the solutions to the subordinate activities are realized in the set of admissible solutions satisfying the superior activities. We present our theoretical results and empirically evaluate our approach in a simulated domain.

The Twenty-First International Conference on Machine Learning (ICML04), Banf, Canada, July 4-8, 2004

Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

Khashayar Rohanimanesh, Robert Platt, Sridhar Mahadevan, Roderic Grupen, Khashayar Rohanimanesh, Robert Platt, Sridhar Mahadevan, Roderic Grupen

In sequence modeling, we often wish to represent complex interaction between labels, such as when performing multiple, cascaded labeling tasks on the same sequence, or when longrange dependencies exist.

We present dynamic conditional random fields (DCRFs), a generalization of linear-chain conditional random fields (CRFs) in which each time slice contains a set of state variables and edges—a distributed state representation as in dynamic Bayesian networks (DBNs)—and parameters are tied across slices.

Since exact inference can be intractable in such models, we perform approximate inference using several schedules for belief propagation, including tree-based reparameterization (TRP). On a natural-language chunking task, we show that a DCRF performs better than a series of linearchain CRFs, achieving comparable performance using only half the training data.

NIPS workshop on Syntax, Semantics, and Statistics, Vancouver, Canada, December 2003

Dynamic Conditional Random Fields for Jointly Labeling Multiple Sequences

Andrew McCallum, Khashayar Rohanimanesh, Charles Sutton, Andrew McCallum, Khashayar Rohanimanesh, Charles Sutton

Conditional random fields (CRFs) for sequence modeling have several advantages over joint models such as HMMs, including the ability to relax strong independence assumptions made in those models, and the ability to incorporate arbitrary overlapping features. Previous work has focused on linear-chain CRFs, which correspond to finite-statemachines, and have efficient exact inference algorithms.

Often, however, we wish to label sequence data in multiple interacting ways—for example, performing part-of-speech tagging and noun phrase segmentation simultaneously, increasing joint accuracy by sharing information between them.

We present dynamic conditional randomfields (DCRFs), which are CRFs in which each time slice has a set of state variables and edges—a distributed state representation as in dynamic Bayesian networks—and parameters are tied across slices. (They could also be called conditionallytrained Dynamic Markov Networks.) Since exact inference can be intractable in these models, we perform approximate inference using the tree-based reparameterization framework (TRP). We also present empirical results comparing DCRFs with linear-chain CRFs on natural language data.

NIPS 2003

Sparse greedy minimax probability machine classification

The Minimax Probability Machine Classification (MPMC) framework [Lanckriet et al., 2002] builds classifiers by minimizing the maximum probability of misclassification, and gives direct estimates of the probabilistic accuracy bound ­.

The only assumptions that MPMC makes is that good estimates of means and covariance matrices of the classes exist. However, as with Support Vector Machines, MPMC is computationally expensive and requires extensive cross validation experiments to choose kernels and kernel parameters that give good performance.

In this paper we address the computational cost of MPMC by proposing an algorithm that constructs nonlinear sparse MPMC (SMPMC) models by incrementally adding basis functions (i.e. kernels) one at a time – greedily selecting the next one that maximizes the accuracy bound ­. SMPMC automatically chooses both kernel parameters and feature weights without using computationally expensive cross validation.

Therefore the SMPMC algorithm simultaneously addresses the problem of kernel selection and feature selection (i.e. feature weighting), based solely on maximizing the accuracy bound ­. Experimental results indicate that we can obtain reliable bounds ­, as well as test set accuracies that are comparable to state of the art classification algorithms.

International Conference on Machine Learning (ICML), August 2003

Fast query-optimized kernel machine classification via incremental approximate nearest support vectors

Dennis DeCoste, D. Mazzoni, Dennis DeCoste, D. Mazzoni

Support vector machines (and other ker-nel machines) offer robust modern machine learning methods for nonlinear classification. However, relative to other alternatives (such as linear methods, decision trees and neu-ral networks), they can be orders of mag-nitude slower at query-time.

Unlike exist-ing methods that attempt to speedup query-time, such as reduced set compression (e.g. (Burges, 1996)) and anytime bounding (e.g. (DeCoste, 2002), we propose a new and ef-ficient approach based on treating the ker-nel machine classifier as a special form of k nearest-neighbor.

Our approach improves upon a traditional k-NN by determining at query-time a good k for each query, based on pre-query analysis guided by the origi-nal robust kernel machine. We demonstrate effectiveness on high-dimensional benchmark MNIST data, observing a greater than 100-fold reduction in the number of SVs required per query (amortized over all 45 pairwise MNIST digit classifiers), with no extra test errors (in fact, it happens to make 4 fewer)

SIAMDM03, May 2003

Anytime query-tuned kernel machines via Cholesky factorization. Proceedings of SIAM International Conference on Data Mining

Dennis DeCoste, Dennis DeCoste

Kernel machines (including support vector machines) offer powerful new methods for improving the accuracy and robustness of fundamental data mining operations on challenging (e.g. high-dimensional) data, including classification, regression, dimensionality reduction, and outlier detection.

However, a key tradeoff to this power is that kernel machines typically compute their outputs in terms of a large fraction of the training data, making it difficult to scale them up to train and run over massive data sets typically tackled in data mining contexts.

We recently demonstrated 2 to 64-fold querytime speedups of SVM and Kernel Fisher classifiers via a new computational geometry method for anytime output bounds [4]. This new paper refines our approach in two key ways.

First, we introduce a simple linear algebra formulation based on standard Cholesky factorization, yielding simpler equations and lower computational overhead. Second, this new formulation suggests new methods for achieving additional speedups, including tuning on query samples. We demonstrate effectiveness on three benchmark datasets.

WProceedings of KDD-2003. pp. 717-722. Washington, DC. 2003

Experimental Design for Solicitation Campaigns

Uwe Mayer, Armand Sarkissian

Data mining techniques are routinely used by fundraisers to select those prospects from a large pool of candidates who are most likely to make a financial contribution. These techniques often rely on statistical models based on trial performance data.

This trial performance data is typically obtained by soliciting a smaller sample of the possible prospect pool. Collecting this trial data involves a cost; therefore the fundraiser is interested in keeping the trial size small while still collecting enough data to build a reliable statistical model that will be used to evaluate the remain-der of the prospects.

We describe an experimental design approach to optimally choose the trial prospects from an existing large pool of prospects. Pros-pects are clustered to render the problem practically tractable. We modify the standard D-optimality algorithm to prevent repeated selection of the same prospect cluster, since each prospect can only be solicited at most once. We assess the benefits of this approach on the KDD-98 data set by comparing the performance of the model based on the optimal trial data set with that of a model based on a randomly selected trial data set of equal size.