Stemming via distribution-based word segregation for classification and retrieval

Bhamidipati, N. L. ; Pal, S. K. (2007) Stemming via distribution-based word segregation for classification and retrieval IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, 37 (2). pp. 350-360. ISSN 1083-4419

Full text not available from this repository.

Official URL:

Related URL:


A novel corpus-based method for stemmer refinement, which can provide improvement in both classification and retrieval, is described. The method models the given words as generated from a multinomial distribution over the topics available in the corpus and includes a procedurelike sequential hypothesis testing that enables grouping together distributionally similar words. The system can refine any stemmer, and its strength can be controlled with parameters that reflect the amount of tolerance to be allowed in computing the similarity between the distributions of two words. Although obtaining the morphological roots of the given words is not the primary objective, the algorithm automatically does that to some extent. Despite a huge reduction in dictionary size, classification accuracies are seen to improve significantly when the proposed system is applied on some existing stemmers for classifying 20 Newsgroups and WebKB data. The refinements obtained are also suitable for cross-corpus stemming. Regarding retrieval, its superiority is extensively demonstrated with respect to four existing methods.

Item Type:Article
Source:Copyright of this article belongs to IEEE.
ID Code:77703
Deposited On:14 Jan 2012 06:12
Last Modified:14 Jan 2012 06:12

Repository Staff Only: item control page