Home

Introduction
  - Boolean Retrieval
  - Word Association
  - Document Representation
  - Vector Space Model
  - Probabilistic Retrieval 
  - Latent Semantic Indexing
  Document Classification

Related Sites
Research
  - Projects
  - People
  - Publications
  - Software

Internal
Back to THOR

Unsupervised Document Classification

Document clustering or unsupervised document classification has been used to enhance information retrieval. This is based on the clustering hypothesis, which states that documents having similar contents are also relevant to the same query [11]. A fixed collection of text is clustered into groups or clusters that have similar contents. The similarity between documents is usually measured with the associative coefficients from the vector space model, e.g., the cosine coefficient. Hierarchical clustering algorithms have been primarily been used in document clustering. The single link method has mostly been used, as it is computationally feasible, but the complete link method seems to be the most effective but is very computationally demanding [37]. 

Other methods than document vector similarity have been used for clustering. Neural models have been implemented for unsupervised document clustering [38]. 

The long computation time has always been the problem when using document clustering on-line. More recently fast algorithms for clustering have been introduced to use for browsing through collection when the user has little information about the collection and wants to brows for topics [39]. Suffix Tree Clustering is new clustering method, which creates clusters based on phrases shared between documents, works fast and intended for Web document clustering [40]. Different projections techniques, LSI and truncation, have been investigate to speed up the distance calculations of clustering [41]. An interesting application of clustering is topic clustering, i.e. clustering documents returned from a specific query, using $k$-means clustering [42]. 

Effectiveness of five hierarchical clustering algorithms have been examined: single link, complete link, group average, Ward's method, and weighted average [43]. Single link is the only that badly compared to the others, but the results are very much dependent on the data set. 
 

Supervised Document Classification

Pattern recognition and machine learning has also been applied to document classification. As before the term frequency is used as feature. A number of classifiers have been used to classify documents. An example of these classifiers are neural networks [44,45], support vector machines [46], genetic programming [47], Kohonen type self-organizing maps [48], hierarchically organized neural network built up from a number of independent self-organizing maps [49], fuzzy $c$-means [50], hierarchical Bayesian clustering [51], Bayesian network classifier [52], and naive Bayes classifier [53]. 

Some of these classifiers can be used with unsupervised learning, i.e., unlabeled documents, but the accuracy of a classifier can be enhanced by using a small set of labeled documents [53]. The aim is to use a classifier which need small amount of manually classified documents to be generalized. 

The use of semi-supervised machine learning has emerged recently [53,50]. The learning scheme lies somewhere between supervised and unsupervised, where the class information are learned from the labeled data and the structure of the data from the unlabeled data. 

The performance of four document classification methods have been measured: the naive Bayes classifier, the nearest classifier, decision trees and a subspace method [54]. The naive Bayes classifier and the subspace method outperform the others.