Hi Erik, Thanks for your thoughtful reply!
> It's actually quite rare for simple tokenizers like these to be satisfactory > unless it's a field you can guarantee is indexed/searched in a very > controlled manner, say part numbers or words from a list. In your > example above, none of the three variants would get a hit if the > user searched for "nation". Is that what you want? Yes, this is what I want. The reason for this behavior is that the output of SOLR needs to closely match the search results provided by a different legacy system. Our user have rigidly defined queries. A user who was interested in "nation's" is required either to search for "nations" or "nation*". > But no, Standard* don't have any stemming built in. And > what do you mean by "language specific functionality"? > They do NOT fold accents for instance if that's what > you're getting at. I asked that because I'm not super comfortable I know what's going on under the hood inside these tokenizers. Do they work the same on RightToLeft languages (such as Arabic) as they do in LeftToRight languages? (My assumption regarding the WhiteSpaceTokenizer is that it would be very language/direction neutral) > Could you explain a bit about *why* you want this behavior? In short we have to support multiple languages and match the behavior of an existing non-solr system. -Dave -----Original Message----- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, January 15, 2010 1:42 PM To: solr-user@lucene.apache.org Subject: Re: Stripping Punctuation in a fieldType If you haven't seen it, this page is invaluable for this kind of question: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LetterT okenizerFactory <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Letter TokenizerFactory> LetterTokenizerFactory may well be your friend here, followed by LowerCaserFilterFactory. There is a problem that it would split "nation's" up into "nation" and "s", so searching on "nations" wouldn't get a hit. But you have equally ugly stuff with WhiteSpaceTokenizerFactory as you're finding out. It's actually quite rare for simple tokenizers like these to be satisfactory unless it's a field you can guarantee is indexed/searched in a very controlled manner, say part numbers or words from a list. In your example above, none of the three variants would get a hit if the user searched for "nation". Is that what you want? But no, Standard* don't have any stemming built in. And what do you mean by "language specific functionality"? They do NOT fold accents for instance if that's what you're getting at. Could you explain a bit about *why* you want this behavior? HTH Erick On Fri, Jan 15, 2010 at 1:17 PM, David Seltzer <dselt...@tveyes.com> wrote: > I'm hesitant to change Tokenizers at the moment because what we have is > working so nicely - or so I thought. > > What I'm looking for is case-insensitive search for words and numbers > without any of the stemming features turned on. The new requirement is > that we take punctuation out of the mix. > > Right now when I search for "Obama" I'm not getting any hits on "Obama." > > So I'm basically looking to strip punctuation. The consequence would be > that "nation's", "nations" and "nations," would all be represented the > same way. > > Would the StandardTokenizerFactory accomplish this? > Does it have any language specific functionality? > Does it do anything with stemming? > > Thanks for everyone's input! > > -Dave > > > > -----Original Message----- > From: Ahmet Arslan [mailto:iori...@yahoo.com] > Sent: Friday, January 15, 2010 12:42 PM > To: solr-user@lucene.apache.org > Subject: Re: Stripping Punctuation in a fieldType > > > I'm trying to find the best way to set up a fieldType that > > strips punctuation. > > Use solr.StandardTokenizerFactory that strips punctuations. > > Or if you do not care about alphanumeric or numeric queries use > solr.LowerCaseTokenizerFactory that uses LetterTokenizer. > > I think the right way to do this is using a > > CharacterFilter > > of some type, but I can't seem to find any examples of how > > to set this > > up in a schema.xml file. > > If you want to use solr.MappingCharFilterFactory you need to write all > punctiation characters to a text file manually. e.g. "," => "" > > > >