There's a lot of reasons, with the performance hit being notable--but also because I feel that using a regex on something this basic amounts to a lazy hack. I'm typically against regular expressions in XML. I'm vehemently opposed to them in cases where not using them should otherwise be quite trivial. Regarding LowerCaseFilter, etc:
My question is: Why should LowerCaseFilter be the means by which that work is done? I fully agree with keeping things DRY, but I'm not quite sure I agree with how that mantra is being employed. For instance, the two tokenizer statements: <tokenizer class="solr.WhiteSpaceTokenizer" downCase="true"> <tokenizer class="solr.LowerCaseLetterTokenizer"> Can be written to utilize the same codebase, which makes things DRY and *may* even be a bit more performant for less trivial transformations. If nothing else, I think a "CharacterTokenizer" would be good way to go. <tokenizer class="solr.CharacterTokenizer" downCase="true" tokenizeSpecialCharacters="true" tokenizeWhiteSpace="true" tokenizedCharcterClasses="wd"/> All that said :) I don't promote myself as an expert and I'm happy to be shown the light / slapped across the head. Scott On Tue, Sep 14, 2010 at 3:10 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote: > How about patching the LetterTokenizer to be capable of tokenizing how you > want, which can then be combined with a LowerCaseFilter (or not) as desired. > Or indeed creating a new tokenizer to do exactly what you want, possibly > (but one that doesn't combine an embedded lowercasefilter in there too!). > Instead of patching the LowerCaseTokenizer, which is of dubious value. Just > brainstorming. > > Another way to tokenize based on "Non-Whitespace/Alpha/Numeric > character-content" might be using the existing PatternTokenizerFactory with > a suitable regexp, as you mention. Which of course could do what the > LetterTokenizer does to, but presumably not as efficiently. Is that what > gives you an uncomfortable feeling? If it performs worse enough to matter, > then that's why you'd need a custom tokenizer, other than that I'm not sure > anything's undesirable about the PatternTokenizer. > > > Jonathan > > Scott Gonyea wrote: > >> I'd agree with your point entirely. My attacking LowerCaseTokenizer was a >> result of not wanting to create yet more Classes. >> >> That said, rightfully dumping LowerCaseTokenizer would probably have me >> creating my own Tokenizer. >> >> I could very well be thinking about this wrong... But what if I wanted to >> create tokens based on Non-Whitespace/Alpha/Numeric character-content? >> >> It looks like I could perhaps use the PatternTokenizer, but that didn't >> leave me with a comfortable feeling when I had first looked into it. >> >> Scott >> >> On Tue, Sep 14, 2010 at 2:48 PM, Robert Muir <rcm...@gmail.com> wrote: >> >> >> >>> Jonathan, you bring up an excellent point. >>> >>> I think its worth our time to actually benchmark this LowerCaseTokenizer >>> versus LetterTokenizer + LowerCaseFilter >>> >>> This tokenizer is quite old, and although I can understand there is no >>> doubt >>> its technically faster than LetterTokenizer + LowerCaseFilter even today >>> (as >>> it can just go through the char[] only a single time), I have my doubts >>> that >>> this brings any value these days... >>> >>> >>> On Tue, Sep 14, 2010 at 5:23 PM, Jonathan Rochkind <rochk...@jhu.edu> >>> wrote: >>> >>> >>> >>>> Why would you want to do that, instead of just using another tokenizer >>>> >>>> >>> and >>> >>> >>>> a lowercasefilter? It's more confusing less DRY code to leave them >>>> >>>> >>> separate >>> >>> >>>> -- the LowerCaseTokenizerFactory combines anyway because someone >>>> decided >>>> >>>> >>> it >>> >>> >>>> was such a common use case that it was worth it for the demonstrated >>>> performance advantage. (At least I hope that's what happened, otherwise >>>> there's no excuse for it!). >>>> >>>> Do you know you get a worthwhile performance benefit for what you're >>>> >>>> >>> doing? >>> >>> >>>> If not, why do it? >>>> >>>> Jonathan >>>> >>>> >>>> Scott Gonyea wrote: >>>> >>>> >>>> >>>>> I went for a different route: >>>>> >>>>> https://issues.apache.org/jira/browse/LUCENE-2644 >>>>> >>>>> Scott >>>>> >>>>> On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcm...@gmail.com> >>>>> wrote: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org> >>>>>> >>>>>> >>>>> wrote: >>> >>> >>>> >>>>>> >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't >>>>>>> >>>>>>> >>>>>> create >>> >>> >>>> tokens, based solely on lower-casing characters. Is there a way to >>>>>>> >>>>>>> >>>>>> tell >>> >>> >>>> >>>>>>> >>>>>> it >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> NOT to drop non-characters? It's amazingly frustrating that the >>>>>>> TokenizerFactory and the FilterFactory have two entirely different >>>>>>> >>>>>>> >>>>>> modes >>> >>> >>>> >>>>>>> >>>>>> of >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> behavior. If I wanted it to tokenize based on non-lower case >>>>>>> characters.... >>>>>>> wouldn't I use, say, LetterTokenizerFactory and tack on the >>>>>>> LowerCaseFilterFactory? Or any number of combinations that would >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> otherwise >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> achieve that specific end-result? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> I don't think you should use LowerCaseTokenizerFactory if you dont >>>>>> want >>>>>> to >>>>>> divide text on non-letters, its intended to do just that. >>>>>> >>>>>> from the javadocs: >>>>>> LowerCaseTokenizer performs the function of LetterTokenizer and >>>>>> LowerCaseFilter together. It divides text at non-letters and converts >>>>>> them >>>>>> to lower case. While it is functionally equivalent to the combination >>>>>> >>>>>> >>>>> of >>> >>> >>>> LetterTokenizer and LowerCaseFilter, there is a performance advantage >>>>>> >>>>>> >>>>> to >>> >>> >>>> doing the two tasks at once, hence this (redundant) implementation. >>>>>> >>>>>> >>>>>> >>>>>> So... Is there a way for me to tell it to NOT split based on >>>>>> non-characters? >>>>>> Use a different tokenizer that doesn't split on non-characters, >>>>>> followed by >>>>>> a LowerCaseFilter >>>>>> >>>>>> -- >>>>>> Robert Muir >>>>>> rcm...@gmail.com >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> -- >>> Robert Muir >>> rcm...@gmail.com >>> >>> >>> >> >> >> >