SelvarClustMV: Variable selection approach in model-based clustering allowing for missing values


  • Cathy Maugis-Rabusseau Université de Toulouse
  • Marie-Laure Martin-Magniette
  • Sandra Pelletier


Overabundance of clustering methods exists but none was devised with a variable selection procedure and a missing data management. However in microarray datasets, genes are described by a growing number of experiments and missing data always exist. It is also important to detect the relevant experiments for improving the gene clustering and the data interpretation. A common practice is to remove genes with missing values or to replace missing values with estimation. However it is known to have an important impact on the clustering result. We tackle variable selection and missing data in a unique statistical framework: A versatile variable selection model based on multidimensional Gaussian mixtures is proposed, taking variable roles for clustering into account. Moreover this statistical framework manages missing values without imposing any data pre-processing. Numerical experiments highlight the gain of our method compared to imputation methods which do not allow to find the true variable roles and sometimes lose biological information.