TITLE: Independent Components in Text

AUTHORS: Thomas Kolenda and Lars Kai Hansen

Department of Mathematical Modelling, Building 321
Technical University of Denmark, DK-2800 Lyngby, Denmark
emails: thko,lkhansen@imm.dtu.dk
www: http://eivind.imm.dtu.dk

ABSTRACT:

In this communication we analyze the feasibility of independent component analysis (ICA) for dimensional reduction and representation of word histograms. The analysis is carried out in a likelihood framework which allows estimates of the loadings (source signals), the mixing matrix and the noise level. In the face of noisy signals, the estimated sources are non-linear functionals of the observed signals, in contrast to the linear noise free case. We also discuss the generalizability of the estimated models and show that an empirical test error estimate may be used to optimize model dimensionality, in particular the optimal number of sources. When applied to word histograms ICA is shown to produce representations that are better aligned with the group structure in the text data than the LSA.

Submitted for NIPS*99, Denver, November 29 - December 4 1999