TITLE: Independent Components in Text
AUTHORS: Thomas Kolenda and Lars Kai Hansen
Department of Mathematical Modelling, Building 321
Technical University of Denmark, DK-2800 Lyngby, Denmark
emails: thko,lkhansen@imm.dtu.dk
www: http://eivind.imm.dtu.dk
ABSTRACT:
In this communication we analyze the feasibility of independent
component analysis (ICA) for dimensional reduction and representation of
word histograms.
The analysis is carried out in a likelihood framework which allows
estimates of the loadings (source signals), the mixing matrix and the noise level.
In the face of noisy signals, the estimated sources
are non-linear functionals of the observed signals, in contrast to
the linear noise free case.
We also discuss the generalizability of the estimated models and show
that an empirical test error estimate may be used to optimize
model dimensionality, in particular the optimal number of sources.
When applied to word histograms ICA is shown to produce
representations that are better aligned with the group structure in the text data than the LSA.
Submitted for NIPS*99, Denver, November 29 - December 4 1999