Model selection for unsupervised learning

Model selection is a crucial task. Let there be some data, a learning task, and a set of candidate models that aim at solving the task. It is desirable to have a prioritization of the candidate models that enables one to select the best model given the task and given the data. The simplest such selection problem in unsupervised learning (most intuitive but not easy to solve!) is selecting the model order of a model. For instance, one must select the number of clusters in clustering, the number of factors in factor analysis, or the number of principal components in PCA.
We approach the model-order selection problem with the minimum transfer cost principle, a method that enables one to use cross-validation for unsupervised learning. In [1], we describe this method and apply it to a number of unsupervised learning problems such as k-means, SVD/PCA, Gaussian mixture models (for GMM, cross-validation is in fact applicable without our method), correlation clustering, and Boolean matrix factorization (in context of the role mining problem).
In [2] and [3], we use the framework of approximation set coding for model selection. This is an information-theoretic principle for regularized optimization. In [2], we apply it to SVD, in [3] we apply it to GMM and Boolean matrix factorization via multi-assignment clustering. The latter paper is optimal to start with if you haven't heard about approximation set coding earlier.

Relevant publications:
  1. Mario Frank, Morteza Haghir Chehreghani and Joachim M. Buhmann "The Minimum Transfer Cost Principle for Model-Order Selection". ECML PKDD 2011: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
    [ bib | pdf | abstract | doi ]
     
  2. Mario Frank and Joachim M. Buhmann "Selecting the rank of truncated SVD by Maximum Approximation Capacity". ISIT 2011: IEEE International Symposium on Information Theory
    [ bib | pdf @ arXiv | abstract | doi ]
     
  3. Joachim M. Buhmann, Morteza Haghir Chehreghani, Mario Frank and Andreas P. Streich "Information Theoretic Model Selection for Pattern Analysis". in JMLR Workshop and Conference Proceedings 7, 1-8: ICML 2011 Workshop on Unsupervised and Transfer Learning
    [ bib | pdf | slides | abstract ]