Home

Introduction
  - Boolean Retrieval
  - Word Association
  - Document Representation
  - Vector Space Model
  - Probabilistic Retrieval 
  Latent Semantic Indexing
  - Document Classification

Related Sites
Research
  - Projects
  - People
  - Publications
  - Software

Internal
Back to THOR

Retrieval methods suffer from two well known language related problems called synonymy and polysemy [32]. Synonymy describes that a object can be referred in many ways, i.e., people use different words to search for the same object. An example of this are the words car and automobile. Polysemy is the problem of words having more than one specific meaning. An example of this is the word jaguar which could mean a well known car type or an animal. Latent Semantic Indexing (LSI) [32] offers a dampening of synonymy. By using a Singular Value Decomposition (SVD) on a term by document matrix of term frequency. The dimension of the transformed space is reduced by selection of the highest singular values, where the most of the variance of the original space is. By using SVD the major associative patterns are extracted from the document space and the small patterns are ignored. The query terms can also transform into this subspace, and can lie close to documents where the terms does not appear. The advantage of LSI is that it is fully automatic and does not use language expertise and the positive side effect is that the length of the document vector becomes much shorter. 

Empirical studies of LSI have been good. LSI has also been examined analytically. By comparing LSI to multidimensional scaling it has been shown that LSI preserves the document space optimally when using the inner product similarity function [33]. The same article implies that this applies also to other similarity measures. By using Bayesian regression model it is shown that by removing the small singular values, statistically dubious information are being removed and also specification errors are reduced [34]. 

Apart from being used with the usual vector space similarity measured, LSI has also been incorporated into a neural network model [35]. 

In [36] it is claimed that they have made a better feature reduction algorithm then LSI, with less classification error and fewer dimensions, using distributional clustering of words.