It seems like what's desired is not so much a stemmer as what you might call a "canonicalizer", which would translate each source word not into its "stem" but into its "most canonical form". Critically, the latter, by definition, is always a legitimate word, e.g. "run". What's more, it's always the "most appropriate word" or "most general word", or some such.
I'm not sure you could implement this except through a massive dictionary. And you'd have trouble because some words would probably be ambiguous between whether they should canonicalize this way or that. On Fri, Jan 23, 2009 at 11:53 AM, Thushara Wijeratna <thu...@gmail.com>wrote: > hi Ahmet, > > thanks. when i look at the non_stemmed_text field to get the top terms, i > will not be getting the useful feature of aggregating many related words > into one (which is done by stemming). > > for ex: if a document has run(10), running(20), runner(2), runners(8) - i > would like to see a a "top term" to be "run" here. i think with the > non-stemmed solution, i will see run, running, runner, runners as separate > top terms so if the term "weather" happens to occur 21 times in the > document, it will replace any version of "run" as the top term. > > of course i could go back to the text field for top terms where i will see > "run", but some of the terms in the text field will be non-english (stemmed > beyond english, ex: archiv, perman). so how can i tell if a term i see in > the text field is a "badly stemmed" word or not? > > maybe at this point i could use a dictionary? if a term in the text field > is > not in the dictionary, i would try to find a prefix match from the > non-stemmed field? or maybe there's a better way? > > thanks, > thushara > > On Fri, Jan 23, 2009 at 11:37 AM, AHMET ARSLAN <iori...@yahoo.com> wrote: > > > I think best way to get non-stemmed top terms is to index the field using > a > > fieldType that does not employes any stem filter. For example: > > > > <fieldType name="non_stemmed_text" class="solr.TextField"> > > <analyzer > > class="org.apache.lucene.analysis.standard.StandardAnalyzer"/> > > </fieldType> > > > > By using copyField you can store two (or more) versions of a field. > Stemmed > > and non-stemmed. > > > > Just a new field: > > <field name="text" type="non_stemmed_text" indexed="true" stored="true" > /> > > > > And a copy field: > > <copyField source="your_original_field" dest="text" /> > > > > Schema Browser (Field: text) will give you top terms. > > > > > Is it possible to retrieve the original words once solr > > > (Porter algorithm) > > > stems them? > > > I need to index a bunch of data, store it in solr, and get > > > back a list of > > > most frequent terms out of solr. and i want to see the > > > non-stemmed version > > > of this data. > > > > > > so basically, i want to enhance this: > > > http://localhost:8983/solr/admin/schema.jsp to see the > > > "top terms" in > > > non-stemmed form. > > > > > > thanks, > > > thushara > > > > > > > > >