RE: Stripping Punctuation in a fieldType

David Seltzer Fri, 15 Jan 2010 11:08:15 -0800

Hi Erik,

Thanks for your thoughtful reply!

> It's actually quite rare for simple tokenizers like these to be
satisfactory
> unless it's a field you can guarantee is indexed/searched in a very
> controlled manner, say part numbers or words from a list. In your
> example above, none of the three variants would get a hit if the
> user searched for "nation". Is that what you want?

Yes, this is what I want. The reason for this behavior is that the
output of SOLR needs to closely match the search results provided by a
different legacy system. Our user have rigidly defined queries. A user
who was interested in "nation's" is required either to search for
"nations" or "nation*".

> But no, Standard* don't have any stemming built in. And
> what do you mean by "language specific functionality"?
> They do NOT fold accents for instance if that's what
> you're getting at.

I asked that because I'm not super comfortable I know what's going on
under the hood inside these tokenizers. Do they work the same on
RightToLeft languages (such as Arabic) as they do in LeftToRight
languages? (My assumption regarding the WhiteSpaceTokenizer is that it
would be very language/direction neutral)

> Could you explain a bit about *why* you want this behavior?
In short we have to support multiple languages and match the behavior of
an existing non-solr system.

-Dave

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, January 15, 2010 1:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Stripping Punctuation in a fieldType

If you haven't seen it, this page is invaluable for this kind of
question:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LetterT
okenizerFactory
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Letter
TokenizerFactory>

LetterTokenizerFactory may well be your friend here, followed by
LowerCaserFilterFactory. There is a problem that it would
split "nation's" up into "nation" and "s", so searching on "nations"
wouldn't get a hit.

But you have equally ugly stuff with WhiteSpaceTokenizerFactory
as you're finding out.

It's actually quite rare for simple tokenizers like these to be
satisfactory
unless it's a field you can guarantee is indexed/searched in a very
controlled manner, say part numbers or words from a list. In your
example above, none of the three variants would get a hit if the
user searched for "nation". Is that what you want?

But no, Standard* don't have any stemming built in. And
what do you mean by "language specific functionality"?
They do NOT fold accents for instance if that's what
you're getting at.

Could you explain a bit about *why* you want this behavior?

HTH
Erick

On Fri, Jan 15, 2010 at 1:17 PM, David Seltzer <dselt...@tveyes.com>
wrote:

> I'm hesitant to change Tokenizers at the moment because what we have
is
> working so nicely - or so I thought.
>
> What I'm looking for is case-insensitive search for words and numbers
> without any of the stemming features turned on. The new requirement is
> that we take punctuation out of the mix.
>
> Right now when I search for "Obama" I'm not getting any hits on
"Obama."
>
> So I'm basically looking to strip punctuation. The consequence would
be
> that "nation's", "nations" and "nations," would all be represented the
> same way.
>
> Would the StandardTokenizerFactory accomplish this?
> Does it have any language specific functionality?
> Does it do anything with stemming?
>
> Thanks for everyone's input!
>
> -Dave
>
>
>
> -----Original Message-----
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: Friday, January 15, 2010 12:42 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Stripping Punctuation in a fieldType
>
> > I'm trying to find the best way to set up a fieldType that
> > strips punctuation.
>
> Use solr.StandardTokenizerFactory that strips punctuations.
>
> Or if you do not care about alphanumeric or numeric queries use
> solr.LowerCaseTokenizerFactory that uses LetterTokenizer.
>
> I think the right way to do this is using a
> > CharacterFilter
> > of some type, but I can't seem to find any examples of how
> > to set this
> > up in a schema.xml file.
>
> If you want to use solr.MappingCharFilterFactory you need to write all
> punctiation characters to a text file manually. e.g. "," => ""
>
>
>
>

RE: Stripping Punctuation in a fieldType

Reply via email to