Home

Introduction
  - Boolean Retrieval
  - Word Association
  - Document Representation
  - Vector Space Model
  - Probabilistic Retrieval 
  - Latent Semantic Indexing
  - Document Classification

Related Sites
Research
  - Projects
  - People
  - Publications
  - Software

Internal
Back to THOR

Our view on text analysis is a classic pattern recognition problem. Features are extracted from the text, the dimension of the data features is reduced if necessary, and at last the data is fed to a classifier which classifies the text. Text in this context can be articles, document, web pages, etc.

Features
The features used at this point are word occurrence in the text, as cooccurrence of words in the text occurs rarely and are thus very sparse. On the other hand could images use higher order features. In practice a term-document matrix is made where each element in the matrix notes the occurrence of term in a document. The matrix could eventually be normalized e.g. , to remove the effects of different length of the texts.

Feature Subspace Selection
At the moment we use Latent Semantic Analysis (LSA) for selecting the most important features. It is based on Principal Components Analysis (PCA) where the variance in the feature space is the measurement of importance.  The problem with PCA is the use of Singular Value Decomposition (SVD) which is very time consuming.

Classifiers
The classifiers currently used are Independent Components Analysis (ICA) and Gaussian Mixture Models (GMM). The GMM has the flexibility to be either supervised or unsupervised i.e., estimated from either labeled or unlabeled texts. In both cases the density of the input is estimated as a mixture of Gaussian functions which enables the groups of texts to be clustered.

Future work
A faster way of finding a good feature subspace for both labeled and unlabeled text. Modeling the classifier as an Artificial Neural Network (ANN) to classify labeled text, mostly used for document classification. We are also planning on expanding the data type to images soon.