TITLE: Probabilistic Hierarchical Clustering with Labeled and Unlabeled Data

AUTHORS: Jan Larsen, Anna Szymkowiak, Lars Kai Hansen
Informatics and Mathematical Modelling, Building 321
Technical University of Denmark, DK-2800 Lyngby, Denmark
emails: asz,jl,lkhansen@imm.dtu.dk
www: http://eivind.imm.dtu.dk


This paper presents hierarchical probabilistic clustering methods for unsupervised and supervised learning in datamining applications, where supervised learning is performed using both labeled and unlabeled examples. The probabilistic clustering is based on the previously suggested Generalizable Gaussian Mixture model and is extended using a modified Expectation Maximization procedure for learning with both unlabeled and labeled examples. The proposed hierarchical scheme is agglomerative and based on probabilistic similarity measures. Here, we compare a L2 dissimilarity measure, error confusion similarity, and accumulated posterior cluster probability measure. The unsupervised and supervised schemes are successfully tested on artificially data and for e-mails segmentation.

invited submission for International Journal of Knowledge-Based Intelligent Engineering Systems, 2001.