ABSTACT OF Professor Kaski'S TALK

 

Learning metrics for exploratory data analysis

Visualization and cluster analysis of multivariate data is usually based on distances between the data samples. The distance measure is often heuristically chosen, for instance by choosing suitable features and then using a global Euclidean metric. We have developed methods that remove the arbitrariness and aim at measuring distances only along important (local) directions. We assume that there exists auxiliary data, paired with the primary data, and that changes in the primary data are important or relevant if they cause changes in the auxiliary data. For example, in analysis of gene expression the auxiliary data can indicate the functional classes of the genes. The distances are learned based on a finite data set and they can be used for instance in clustering and Self-Organizing Map-based data visualization. The methods have been applied in analysis of bankruptcy, text documents, and gene expression.