I didn't understand what exactly you want. if a document has run(10), running(20), runner(2), runners(8): (assuming stemmer reduces all those words to run) with non-stemmed you will see: running(20) run(10) runners(8) runner(2)
with stemmed you will see: run(40) You want to see run as a top term but also you want to see the original words that formed that term? run(40) => 20 from running, 10 from run, 8 from runners, 2 from runner Or do you want to see most frequent terms that passed through stem filter verbatim? (terms that stemmer didn't change/modify) What do you mean by saying "badly stemmed" word? > hi Ahmet, > > thanks. when i look at the non_stemmed_text field to get > the top terms, i > will not be getting the useful feature of aggregating many > related words > into one (which is done by stemming). > > for ex: if a document has run(10), running(20), runner(2), > runners(8) - i > would like to see a a "top term" to be > "run" here. i think with the > non-stemmed solution, i will see run, running, runner, > runners as separate > top terms so if the term "weather" happens to > occur 21 times in the > document, it will replace any version of "run" as > the top term. > > of course i could go back to the text field for top terms > where i will see > "run", but some of the terms in the text field > will be non-english (stemmed > beyond english, ex: archiv, perman). so how can i tell if a > term i see in > the text field is a "badly stemmed" word or not? > > maybe at this point i could use a dictionary? if a term in > the text field is > not in the dictionary, i would try to find a prefix match > from the > non-stemmed field? or maybe there's a better way? > > thanks, > thushara