| Retrieval methods suffer
from two well known language related problems called synonymy and polysemy
[32]. Synonymy describes that a object
can be referred in many ways, i.e., people use different words to search
for the same object. An example of this are the words car and automobile.
Polysemy is the problem of words having more than one specific meaning.
An example of this is the word jaguar which could mean a well known
car type or an animal. Latent Semantic Indexing (LSI) [32]
offers a dampening of synonymy. By using a Singular Value Decomposition
(SVD) on a term by document matrix of term frequency. The dimension of
the transformed space is reduced by selection of the highest singular values,
where the most of the variance of the original space is. By using SVD the
major associative patterns are extracted from the document space and the
small patterns are ignored. The query terms can also transform into this
subspace, and can lie close to documents where the terms does not appear.
The advantage of LSI is that it is fully automatic and does not use language
expertise and the positive side effect is that the length of the document
vector becomes much shorter.
Empirical studies of LSI have been good. LSI has also been examined analytically. By comparing LSI to multidimensional scaling it has been shown that LSI preserves the document space optimally when using the inner product similarity function [33]. The same article implies that this applies also to other similarity measures. By using Bayesian regression model it is shown that by removing the small singular values, statistically dubious information are being removed and also specification errors are reduced [34]. Apart from being used with the usual vector space similarity measured, LSI has also been incorporated into a neural network model [35]. In [36] it is claimed that they have made a better feature reduction algorithm then LSI, with less classification error and fewer dimensions, using distributional clustering of words.
|