Grants and Contributions:
Grant or Award spanning more than one fiscal year. (2017-2018 to 2022-2023)
Cluster analysis can be defined as the search for interesting groups within data. When a finite mixture model is used for cluster analysis, it is called model-based clustering. Typically, the development of novel model-based clustering approaches has focused on the Gaussian mixture model. Unfortunately, the assumption that the subpopulations of the observed data are Gaussian distributed is ofttimes unrealistic. The research proposed herein will extend the current literature on model-based clustering via development of finite mixtures of non-Gaussian distributions. Specifically, a mixture of contaminated shifted asymmetric Laplace factor analyzers (MCSALFA) will be developed. This model will be well suited for the analysis of high-dimensional and big data that contain spurious, outlying, or noisy observations. Of special interest are data sets whose number of variables exceeds the number of observations, like those arising from microarray gene expression analysis.
From a methodological standpoint, the MCSALFA will unify the factor analysis model and the contaminated mixture model. It will utilize a robust parameter estimation scheme, i.e., one that is not sensitive to outlying points, that is based on a variant of the expectation-maximization (EM) algorithm. It is well-known that using the EM algorithm to estimate the parameters of a finite mixture model can be detrimental because the likelihood surface typically contains many local maxima. As such, one offshoot of the proposed research will be addressing this concern via different initialization strategies. Once implemented, the applicant will publish a manuscript documenting the model's derivation and classification performance. In addition, open-source software will be released for researchers around the world.
The aforementioned example of data arising from gene expression microarray analysis is only one of many possible applications. Data rife in spurious observations, like those arising from socio-economic studies and sensory studies, will also be targeted. The applicant will establish a research program that focuses on the development of non-Gaussian mixture models. The proposed research project will provide a strong foundation for this research program by providing suitable projects for both undergraduate and graduate students. In addition, because the applications of the proposed research will be far reaching, opportunities for students to collaborate with researchers from other fields will arise.