I didn't understand what exactly you want.

if a document has run(10), running(20), runner(2), runners(8):
(assuming stemmer reduces all those words to run)
with non-stemmed you will see: 
running(20)
run(10)
runners(8)
runner(2)

with stemmed you will see: 
run(40)

You want to see run as a top term but also you want to see the original words 
that formed that term?
run(40) => 20 from running, 10 from run, 8 from runners, 2 from runner

Or do you want to see most frequent terms that passed through stem filter 
verbatim? (terms that stemmer didn't change/modify)

What do you mean by saying "badly stemmed" word?


> hi Ahmet,
> 
> thanks. when i look at the non_stemmed_text field to get
> the top terms, i
> will not be getting the useful feature of aggregating many
> related words
> into one (which is done by stemming).
> 
> for ex: if a document has run(10), running(20), runner(2),
> runners(8) - i
> would like to see a a "top term" to be
> "run" here. i think with the
> non-stemmed solution, i will see run, running, runner,
> runners as separate
> top terms so if the term "weather" happens to
> occur 21 times in the
> document, it will replace any version of "run" as
> the top term.
> 
> of course i could go back to the text field for top terms
> where i will see
> "run", but some of the terms in the text field
> will be non-english (stemmed
> beyond english, ex: archiv, perman). so how can i tell if a
> term i see in
> the text field is a "badly stemmed" word or not?
> 
> maybe at this point i could use a dictionary? if a term in
> the text field is
> not in the dictionary, i would try to find a prefix match
> from the
> non-stemmed field? or maybe there's a better way?
> 
> thanks,
> thushara


      

Reply via email to