HPS: High Precision Stemmer

Authors: Tomáš Brychcín and Miloslav Konopík

HPS is a multilingual stemming tool based on unsupervised training from unlabeled corpora. More technical details can be found in our paper that has been accepted to the Information Processing and Management journal.
Our java implementation can be found here.

Training

Training new model with default parameter settings.
Stemmer stemmer = StemmerBuilder.train("train.txt");
StemmerBuilder.save(stemmer, "model.bin");
Training new model with own parameter settings.
//minimum lexical distance between words with the same stem
float minSimilarity = 0.7f;
//maximum length of suffix that can be stripped off
int maxSuffixLength = 3;
//minimum word occurrence for MMI clustering
int minWordOccurrence = 10;
//minimum word pair occurrence for MMI clustering
int minWordPairOccurrence = 2;

//unsupervised training of new stemming model
Stemmer stemmer = StemmerBuilder.train("train.txt", minWordOccurrence, minWordPairOccurrence, minSimilarity, maxSuffixLength);
//save model
StemmerBuilder.save(stemmer, "model.bin");

Corpus

HPS requires already tokenized text (one stentece per line and the word tokens separated by the white space) in UTF-8 encoding. For example:
Svět se rozloučil s rokem 1998 více méně tradičně . 
Bezpočet přítomných sledoval podle agentury AP s nadšením vzlet 15000 puštěných balónků . 

Stemming

//maximum length of suffix that can be stripped off
int maxSuffixLength = 3;
//load model
Stemmer stemmer = StemmerBuilder.loadStemmer("model.bin", maxSuffixLength);
//stemming
String stem = stemmer.getClass("stolem");

Stemming models

These models were trained on 5,000,000 tokens corpora and were used in experiments presented in our manuscript: Another models that have not been deeply tested yet:

Licence

Our implementation of HPS is licenced under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License

Citation

Please, cite our article if you use any of the available resources.

@article{Brychcin.Konopik:2015,
	title = "HPS: High precision stemmer ",
	journal = "Information Processing \& Management ",
	volume = "51",
	number = "1",
	pages = "68 - 91",
	year = "2015",
	note = "",
	issn = "0306-4573",
	doi = "http://dx.doi.org/10.1016/j.ipm.2014.08.006",
	url = "http://www.sciencedirect.com/science/article/pii/S0306457314000843",
	author = "T. Brychc\'{i}n and M. Konop\'{i}k",
}

Last change: 2014-29-10