also, if you are really concerned about different languages, but can use solr 1.5 then take a look at Unicode Collation.
you can simply add <filter class="solr.CollationKeyFilterFactory" language="" strength="primary" /> after your tokenizer and ignore case,accents,punctuation in a reasonable way for all languages. http://wiki.apache.org/solr/UnicodeCollation On Fri, Jan 15, 2010 at 2:31 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Ah, ok, your approach makes sense. Mostly I was trying > to insure that you weren't flying blind. > > Perhaps you would find some joy with > PatternReplaceCharFilterFactory, replacing > all non-alphanum with empty string? > > HTH > Erick > > On Fri, Jan 15, 2010 at 2:07 PM, David Seltzer <dselt...@tveyes.com> wrote: > >> Hi Erik, >> >> Thanks for your thoughtful reply! >> >> > It's actually quite rare for simple tokenizers like these to be >> satisfactory >> > unless it's a field you can guarantee is indexed/searched in a very >> > controlled manner, say part numbers or words from a list. In your >> > example above, none of the three variants would get a hit if the >> > user searched for "nation". Is that what you want? >> >> Yes, this is what I want. The reason for this behavior is that the >> output of SOLR needs to closely match the search results provided by a >> different legacy system. Our user have rigidly defined queries. A user >> who was interested in "nation's" is required either to search for >> "nations" or "nation*". >> >> > But no, Standard* don't have any stemming built in. And >> > what do you mean by "language specific functionality"? >> > They do NOT fold accents for instance if that's what >> > you're getting at. >> >> I asked that because I'm not super comfortable I know what's going on >> under the hood inside these tokenizers. Do they work the same on >> RightToLeft languages (such as Arabic) as they do in LeftToRight >> languages? (My assumption regarding the WhiteSpaceTokenizer is that it >> would be very language/direction neutral) >> >> > Could you explain a bit about *why* you want this behavior? >> In short we have to support multiple languages and match the behavior of >> an existing non-solr system. >> >> -Dave >> >> -----Original Message----- >> From: Erick Erickson [mailto:erickerick...@gmail.com] >> Sent: Friday, January 15, 2010 1:42 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Stripping Punctuation in a fieldType >> >> If you haven't seen it, this page is invaluable for this kind of >> question: >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LetterT >> okenizerFactory >> <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Letter >> TokenizerFactory> >> >> LetterTokenizerFactory may well be your friend here, followed by >> LowerCaserFilterFactory. There is a problem that it would >> split "nation's" up into "nation" and "s", so searching on "nations" >> wouldn't get a hit. >> >> But you have equally ugly stuff with WhiteSpaceTokenizerFactory >> as you're finding out. >> >> It's actually quite rare for simple tokenizers like these to be >> satisfactory >> unless it's a field you can guarantee is indexed/searched in a very >> controlled manner, say part numbers or words from a list. In your >> example above, none of the three variants would get a hit if the >> user searched for "nation". Is that what you want? >> >> But no, Standard* don't have any stemming built in. And >> what do you mean by "language specific functionality"? >> They do NOT fold accents for instance if that's what >> you're getting at. >> >> Could you explain a bit about *why* you want this behavior? >> >> HTH >> Erick >> >> On Fri, Jan 15, 2010 at 1:17 PM, David Seltzer <dselt...@tveyes.com> >> wrote: >> >> > I'm hesitant to change Tokenizers at the moment because what we have >> is >> > working so nicely - or so I thought. >> > >> > What I'm looking for is case-insensitive search for words and numbers >> > without any of the stemming features turned on. The new requirement is >> > that we take punctuation out of the mix. >> > >> > Right now when I search for "Obama" I'm not getting any hits on >> "Obama." >> > >> > So I'm basically looking to strip punctuation. The consequence would >> be >> > that "nation's", "nations" and "nations," would all be represented the >> > same way. >> > >> > Would the StandardTokenizerFactory accomplish this? >> > Does it have any language specific functionality? >> > Does it do anything with stemming? >> > >> > Thanks for everyone's input! >> > >> > -Dave >> > >> > >> > >> > -----Original Message----- >> > From: Ahmet Arslan [mailto:iori...@yahoo.com] >> > Sent: Friday, January 15, 2010 12:42 PM >> > To: solr-user@lucene.apache.org >> > Subject: Re: Stripping Punctuation in a fieldType >> > >> > > I'm trying to find the best way to set up a fieldType that >> > > strips punctuation. >> > >> > Use solr.StandardTokenizerFactory that strips punctuations. >> > >> > Or if you do not care about alphanumeric or numeric queries use >> > solr.LowerCaseTokenizerFactory that uses LetterTokenizer. >> > >> > I think the right way to do this is using a >> > > CharacterFilter >> > > of some type, but I can't seem to find any examples of how >> > > to set this >> > > up in a schema.xml file. >> > >> > If you want to use solr.MappingCharFilterFactory you need to write all >> > punctiation characters to a text file manually. e.g. "," => "" >> > >> > >> > >> > >> > -- Robert Muir rcm...@gmail.com