It seems like what's desired is not so much a stemmer as what you might call
a "canonicalizer", which would translate each source word not into its
"stem" but into its "most canonical form". Critically, the latter, by
definition, is always a legitimate word, e.g. "run". What's more, it's
always the "most appropriate word" or "most general word", or some such.

I'm not sure you could implement this except through a massive dictionary.
And you'd have trouble because some words would probably be ambiguous
between whether they should canonicalize this way or that.

On Fri, Jan 23, 2009 at 11:53 AM, Thushara Wijeratna <thu...@gmail.com>wrote:

> hi Ahmet,
>
> thanks. when i look at the non_stemmed_text field to get the top terms, i
> will not be getting the useful feature of aggregating many related words
> into one (which is done by stemming).
>
> for ex: if a document has run(10), running(20), runner(2), runners(8) - i
> would like to see a a "top term" to be "run" here. i think with the
> non-stemmed solution, i will see run, running, runner, runners as separate
> top terms so if the term "weather" happens to occur 21 times in the
> document, it will replace any version of "run" as the top term.
>
> of course i could go back to the text field for top terms where i will see
> "run", but some of the terms in the text field will be non-english (stemmed
> beyond english, ex: archiv, perman). so how can i tell if a term i see in
> the text field is a "badly stemmed" word or not?
>
> maybe at this point i could use a dictionary? if a term in the text field
> is
> not in the dictionary, i would try to find a prefix match from the
> non-stemmed field? or maybe there's a better way?
>
> thanks,
> thushara
>
> On Fri, Jan 23, 2009 at 11:37 AM, AHMET ARSLAN <iori...@yahoo.com> wrote:
>
> > I think best way to get non-stemmed top terms is to index the field using
> a
> > fieldType that does not employes any stem filter. For example:
> >
> > <fieldType name="non_stemmed_text" class="solr.TextField">
> >      <analyzer
> > class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
> > </fieldType>
> >
> > By using copyField you can store two (or more) versions of a field.
> Stemmed
> > and non-stemmed.
> >
> > Just a new field:
> > <field name="text" type="non_stemmed_text" indexed="true" stored="true"
> />
> >
> > And a copy field:
> > <copyField source="your_original_field" dest="text" />
> >
> > Schema Browser (Field: text) will give you top terms.
> >
> > > Is it possible to retrieve the original words once solr
> > > (Porter algorithm)
> > > stems them?
> > > I need to index a bunch of data, store it in solr, and get
> > > back a list of
> > > most frequent terms out of solr. and i want to see the
> > > non-stemmed version
> > > of this data.
> > >
> > > so basically, i want to enhance this:
> > > http://localhost:8983/solr/admin/schema.jsp to see the
> > > "top terms" in
> > > non-stemmed form.
> > >
> > > thanks,
> > > thushara
> >
> >
> >
> >
>

Reply via email to