SelvarClustMV: Variable selection approach in model-based clustering allowing for missing values
Auteurs-es
Cathy Maugis-Rabusseau
Université de Toulouse
Marie-Laure Martin-Magniette
Sandra Pelletier
Résumé
Overabundance of clustering methods exists but none was devised with a variable selection procedure and a
missing data management. However in microarray datasets, genes are described by a growing number of experiments
and missing data always exist. It is also important to detect the relevant experiments for improving the gene clustering
and the data interpretation. A common practice is to remove genes with missing values or to replace missing values
with estimation. However it is known to have an important impact on the clustering result. We tackle variable selection
and missing data in a unique statistical framework: A versatile variable selection model based on multidimensional
Gaussian mixtures is proposed, taking variable roles for clustering into account. Moreover this statistical framework
manages missing values without imposing any data pre-processing. Numerical experiments highlight the gain of
our method compared to imputation methods which do not allow to find the true variable roles and sometimes lose
biological information.