We consider the problem of both supervised and unsupervised classification for multidimensional data that are nongaussian and of mixed types (continuous and/or discrete). An important subclass of graphical model techniques called Generalized Linear Statistics (GLS) is used to capture the underlying statistical structure of these complex data.
GLS exploits the properties of exponential family distributions, which are assumed to describe the data components, and constrains latent variables to a lower dimensional parameter subspace.
Based on the latent variable information, classification is performed in the natural parameter subspace with classical statistical techniques. The benefits of decision making in parameter space is illustrated with examples of categorical data text categorization and mixed-type data classification.
As a text document preprocessing tool, an extension from binary to categorical data of the conditional mutual information maximization based feature selection algorithm is presented.
DMAI 2008 11th IEEE International Conf on Comp Info Tech, Khulna, Bangladesh (December 2008)
A Unifying Viewpoint of some Clustering Techniques Using Bregman Divergences and Extensions to Mixed Data Sets
Cecile Levasseur, Brandon Burdge, Ken Kreutz-Delgado, Uwe Mayer
We present a general viewpoint using Bregman divergences and exponential family properties that contains as special cases the three following algorithms: 1) exponential family Principal Component Analysis (exponential PCA), 2) Semi-Parametric exponential family Principal Component Analysis (SP-PCA) and 3) Bregman soft clustering. This framework is equivalent to a mixed data-type hierarchical Bayes graphical model assumption with latent variables constrained to a low-dimensional parameter subspace. We show that within this framework exponential PCA and SPPCA are similar to the Bregman soft clustering technique with the addition of a linear constraint in the parameter space. We implement the resulting modifications to SP-PCA and Bregman soft clustering for mixed (continuous and/or discrete) data sets, and add a nonparametric estimation of the point-mass probabilities to exponential PCA. Finally, we compare the relative performances of the three algorithms in a clustering setting for mixed data sets.
Proceedings of the 2008 IAPR Workshop on Cognitive Information Processing. pp. 126-131. Santorini, Greece. 2008
Generalized Statistical Methods for Unsupervised Minority Class Detection in Mixed Data Sets
Cecile Levasseur, Uwe Mayer, Brandon Burdge, Ken Kreutz-Delgado
Minority class detection is the problem of detecting the occurrence of rare key events differing from the majority of a data set. This paper considers the problem of unsupervised minority class detection for multidimensional data that are highly nongaussian, mixed (continuous and/or discrete), noisy, and nonlinearly related, such as occurs, for example, in fraud detection in typical financial data.
A statistical modeling approach is proposed which is a subclass of graphical model techniques. It exploits the properties of exponential family distributions and generalizes techniques from classical linear statistics into a framework referred to as Generalized Linear Statistics (GLS). The methodology exploits the split between the data space and the parameter space for exponential family distributions and solves a nonlinear problem by using classical linear statistical tools applied to data that has been mapped into the parameter space.
A fraud detection technique utilizing low-dimensional information learned by using an Iteratively Reweighted Least Squares (IRLS) based approach to GLS is proposed in the parameter space for data of mixed type. ROC curves for an initial simulation on synthetic data are presented, which gives predictions for results on actual financial data sets.