|
|
Our view on text analysis
is a classic pattern recognition problem. Features are extracted from the
text, the dimension of the data features is reduced if necessary, and at
last the data is fed to a classifier which classifies the text. Text in
this context can be articles, document, web pages, etc.
Features
The features used at this point are word
occurrence in the text, as cooccurrence of words in the text occurs rarely
and are thus very sparse. On the other hand could images use higher order
features. In practice a term-document matrix is made where each element
in the matrix notes the occurrence of term in a document. The matrix could
eventually be normalized e.g. , to remove the effects of different length
of the texts.
Feature Subspace Selection
At the moment we use Latent Semantic Analysis
(LSA) for selecting the most important features. It is based on Principal
Components Analysis (PCA) where the variance in the feature space is the
measurement of importance. The problem with PCA is the use of Singular
Value Decomposition (SVD) which is very time consuming.
Classifiers
The classifiers currently used are Independent
Components Analysis (ICA) and Gaussian Mixture Models (GMM). The GMM has
the flexibility to be either supervised or unsupervised i.e., estimated
from either labeled or unlabeled texts. In both cases the density of the
input is estimated as a mixture of Gaussian functions which enables the
groups of texts to be clustered.
Future work
A faster way of finding a good feature
subspace for both labeled and unlabeled text. Modeling the classifier as
an Artificial Neural Network (ANN) to classify labeled text, mostly used
for document classification. We are also planning on expanding the data
type to images soon.
|