Re: Stripping Punctuation in a fieldType

Robert Muir Fri, 15 Jan 2010 11:35:29 -0800

also, if you are really concerned about different languages, but can
use solr 1.5 then take a look at Unicode Collation.


you can simply add

<filter class="solr.CollationKeyFilterFactory"
        language=""
        strength="primary"
    />

after your tokenizer and ignore case,accents,punctuation in a
reasonable way for all languages.

http://wiki.apache.org/solr/UnicodeCollation

On Fri, Jan 15, 2010 at 2:31 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> Ah, ok, your approach makes sense. Mostly I was trying
> to insure that you weren't flying blind.
>
> Perhaps you would find some joy with
> PatternReplaceCharFilterFactory, replacing
> all non-alphanum with empty string?
>
> HTH
> Erick
>
> On Fri, Jan 15, 2010 at 2:07 PM, David Seltzer <dselt...@tveyes.com> wrote:
>
>> Hi Erik,
>>
>> Thanks for your thoughtful reply!
>>
>> > It's actually quite rare for simple tokenizers like these to be
>> satisfactory
>> > unless it's a field you can guarantee is indexed/searched in a very
>> > controlled manner, say part numbers or words from a list. In your
>> > example above, none of the three variants would get a hit if the
>> > user searched for "nation". Is that what you want?
>>
>> Yes, this is what I want. The reason for this behavior is that the
>> output of SOLR needs to closely match the search results provided by a
>> different legacy system. Our user have rigidly defined queries. A user
>> who was interested in "nation's" is required either to search for
>> "nations" or "nation*".
>>
>> > But no, Standard* don't have any stemming built in. And
>> > what do you mean by "language specific functionality"?
>> > They do NOT fold accents for instance if that's what
>> > you're getting at.
>>
>> I asked that because I'm not super comfortable I know what's going on
>> under the hood inside these tokenizers. Do they work the same on
>> RightToLeft languages (such as Arabic) as they do in LeftToRight
>> languages? (My assumption regarding the WhiteSpaceTokenizer is that it
>> would be very language/direction neutral)
>>
>> > Could you explain a bit about *why* you want this behavior?
>> In short we have to support multiple languages and match the behavior of
>> an existing non-solr system.
>>
>> -Dave
>>
>> -----Original Message-----
>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> Sent: Friday, January 15, 2010 1:42 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Stripping Punctuation in a fieldType
>>
>> If you haven't seen it, this page is invaluable for this kind of
>> question:
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LetterT
>> okenizerFactory
>> <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Letter
>> TokenizerFactory>
>>
>> LetterTokenizerFactory may well be your friend here, followed by
>> LowerCaserFilterFactory. There is a problem that it would
>> split "nation's" up into "nation" and "s", so searching on "nations"
>> wouldn't get a hit.
>>
>> But you have equally ugly stuff with WhiteSpaceTokenizerFactory
>> as you're finding out.
>>
>> It's actually quite rare for simple tokenizers like these to be
>> satisfactory
>> unless it's a field you can guarantee is indexed/searched in a very
>> controlled manner, say part numbers or words from a list. In your
>> example above, none of the three variants would get a hit if the
>> user searched for "nation". Is that what you want?
>>
>> But no, Standard* don't have any stemming built in. And
>> what do you mean by "language specific functionality"?
>> They do NOT fold accents for instance if that's what
>> you're getting at.
>>
>> Could you explain a bit about *why* you want this behavior?
>>
>> HTH
>> Erick
>>
>> On Fri, Jan 15, 2010 at 1:17 PM, David Seltzer <dselt...@tveyes.com>
>> wrote:
>>
>> > I'm hesitant to change Tokenizers at the moment because what we have
>> is
>> > working so nicely - or so I thought.
>> >
>> > What I'm looking for is case-insensitive search for words and numbers
>> > without any of the stemming features turned on. The new requirement is
>> > that we take punctuation out of the mix.
>> >
>> > Right now when I search for "Obama" I'm not getting any hits on
>> "Obama."
>> >
>> > So I'm basically looking to strip punctuation. The consequence would
>> be
>> > that "nation's", "nations" and "nations," would all be represented the
>> > same way.
>> >
>> > Would the StandardTokenizerFactory accomplish this?
>> > Does it have any language specific functionality?
>> > Does it do anything with stemming?
>> >
>> > Thanks for everyone's input!
>> >
>> > -Dave
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: Ahmet Arslan [mailto:iori...@yahoo.com]
>> > Sent: Friday, January 15, 2010 12:42 PM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: Stripping Punctuation in a fieldType
>> >
>> > > I'm trying to find the best way to set up a fieldType that
>> > > strips punctuation.
>> >
>> > Use solr.StandardTokenizerFactory that strips punctuations.
>> >
>> > Or if you do not care about alphanumeric or numeric queries use
>> > solr.LowerCaseTokenizerFactory that uses LetterTokenizer.
>> >
>> > I think the right way to do this is using a
>> > > CharacterFilter
>> > > of some type, but I can't seem to find any examples of how
>> > > to set this
>> > > up in a schema.xml file.
>> >
>> > If you want to use solr.MappingCharFilterFactory you need to write all
>> > punctiation characters to a text file manually. e.g. "," => ""
>> >
>> >
>> >
>> >
>>
>



-- 
Robert Muir
rcm...@gmail.com

Re: Stripping Punctuation in a fieldType

Reply via email to