|The vector space model
procedure can be divided in to three stages. The first stage is the document
indexing where content bearing terms are extracted from the document text.
The second stage is the weighting of the indexed terms to enhance retrieval
of document relevant to the user. The last stage ranks the document with
respect to the query according to a similarity measure.
The vector space model has been criticized for being ad hoc. For a more theoretical analysis of the vector space model see .
Document IndexingIt is obvious that many of the words in a document do not describe the content, words like the, is. By using automatic document indexing those non significant words (function words) are removed from the document vector, so the document will only be represented by content bearing words . This indexing can be based on term frequency, where terms that have both high and low frequency within a document are considered to be function words [18,4,11]. In practice, term frequency has been difficult to implement in automatic indexing. Instead the use of a stop list which holds common words to remove high frequency words (stop words) [11,4], which makes the indexing method language dependent. In general, 40-50% of the total number of words in a document are removed with the help of a stop list .
Non linguistic methods for indexing have
also been implemented. Probabilistic indexing is based on the assumption
that there is some statistical difference in the distribution of content
bearing words, and function words .
Probabilistic indexing ranks the terms in the collection w.r.t. the term
frequency in the whole collection. The function words are modeled by a
Poisson distribution over all documents, as content bearing terms cannot
be modeled. The use of Poisson model has been expand to Bernoulli model
. Recently, an automatic indexing
method which uses serial clustering of words in text has been introduced
. The value of such clustering
is an indicator if the word is content bearing.
Term WeightingTerm weighting has been explained by controlling the exhaustivity and specificity of the search, where the exhaustivity is related to recall and specificity to precision . The term weighting for the vector space model has entirely been based on single term statistics. There are three main factors term weighting: term frequency factor, collection frequency factor and length normalization factor. These three factor are multiplied together to make the resulting term weight.
A common weighting scheme for terms within a document is to use the frequency of occurrence as stated by Luhn , mentioned in the previous section. The term frequency is somewhat content descriptive for the documents and is generally used as the basis of a weighted document vector . It is also possible to use binary document vector, but the results have not been as good compared to term frequency when using the vector space model .
There are used various weighting schemes to discriminate one document from the other.In general this factor is called collection frequency document. Most of them, e.g. the inverse document frequency, assume that the importance of a term is proportional with the number of document the term appears in . Experimentally it has been shown that these document discrimination factors lead to a more effective retrieval, i.e., an improvement in precision and recall .
The third possible weighting factor is a document length normalization factor. Long documents have usually a much larger term set than short documents, which makes long documents more likely to be retrieved than short documents .
Different weight schemes have been investigated
and the best results, w.r.t. recall and precision, are obtained by using
term frequency with inverse document frequency and length normalization
Similarity CoefficientsThe similarity in vector space models is determined by using associative coefficients based on the inner product of the document vector and query vector, where word overlap indicates similarity. The inner product is usually normalized. The most popular similarity measure is the cosine coefficient, which measures the angle between the a document vector and the query vector. Other measures are e.g., Jaccard and Dice coefficients .