Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Scott Gonyea Tue, 14 Sep 2010 16:56:00 -0700

There's a lot of reasons, with the performance hit being notable--but also
because I feel that using a regex on something this basic amounts to a lazy
hack.  I'm typically against regular expressions in XML.
 I'm vehemently opposed to them in cases where not using them should
otherwise be quite trivial.  Regarding LowerCaseFilter, etc:


My question is: Why should LowerCaseFilter be the means by which that work
is done? I fully agree with keeping things DRY, but I'm not quite sure I
agree with how that mantra is being employed.  For instance, the two
tokenizer statements:

<tokenizer class="solr.WhiteSpaceTokenizer" downCase="true">
<tokenizer class="solr.LowerCaseLetterTokenizer">

Can be written to utilize the same codebase, which makes things DRY and
*may* even be a bit more performant for less trivial transformations.

If nothing else, I think a "CharacterTokenizer" would be good way to go.

<tokenizer class="solr.CharacterTokenizer" downCase="true"
tokenizeSpecialCharacters="true" tokenizeWhiteSpace="true"
tokenizedCharcterClasses="wd"/>


All that said :)  I don't promote myself as an expert and I'm happy to be
shown the light / slapped across the head.

Scott

On Tue, Sep 14, 2010 at 3:10 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote:

> How about patching the LetterTokenizer to be capable of tokenizing how you
> want, which can then be combined with a LowerCaseFilter (or not) as desired.
> Or indeed creating a new tokenizer to do exactly what you want, possibly
> (but one that doesn't combine an embedded lowercasefilter in there too!).
> Instead of patching the LowerCaseTokenizer, which is of dubious value. Just
> brainstorming.
>
> Another way to tokenize based on "Non-Whitespace/Alpha/Numeric
> character-content" might be using the existing PatternTokenizerFactory with
> a suitable regexp, as you mention.  Which of course could do what the
> LetterTokenizer does to, but presumably not as efficiently. Is that what
> gives you an uncomfortable feeling? If it performs worse enough to matter,
> then that's why you'd need a custom tokenizer, other than that I'm not sure
> anything's undesirable about the PatternTokenizer.
>
>
> Jonathan
>
> Scott Gonyea wrote:
>
>> I'd agree with your point entirely.  My attacking LowerCaseTokenizer was a
>> result of not wanting to create yet more Classes.
>>
>> That said, rightfully dumping LowerCaseTokenizer would probably have me
>> creating my own Tokenizer.
>>
>> I could very well be thinking about this wrong...  But what if I wanted to
>> create tokens based on Non-Whitespace/Alpha/Numeric character-content?
>>
>> It looks like I could perhaps use the PatternTokenizer, but that didn't
>> leave me with a comfortable feeling when I had first looked into it.
>>
>> Scott
>>
>> On Tue, Sep 14, 2010 at 2:48 PM, Robert Muir <rcm...@gmail.com> wrote:
>>
>>
>>
>>> Jonathan, you bring up an excellent point.
>>>
>>> I think its worth our time to actually benchmark this LowerCaseTokenizer
>>> versus LetterTokenizer + LowerCaseFilter
>>>
>>> This tokenizer is quite old, and although I can understand there is no
>>> doubt
>>> its technically faster than LetterTokenizer + LowerCaseFilter even today
>>> (as
>>> it can just go through the char[] only a single time), I have my doubts
>>> that
>>> this brings any value these days...
>>>
>>>
>>> On Tue, Sep 14, 2010 at 5:23 PM, Jonathan Rochkind <rochk...@jhu.edu>
>>> wrote:
>>>
>>>
>>>
>>>> Why would you want to do that, instead of just using another tokenizer
>>>>
>>>>
>>> and
>>>
>>>
>>>> a lowercasefilter?  It's more confusing less DRY code to leave them
>>>>
>>>>
>>> separate
>>>
>>>
>>>> -- the LowerCaseTokenizerFactory  combines anyway because someone
>>>> decided
>>>>
>>>>
>>> it
>>>
>>>
>>>> was such a common use case that it was worth it for the demonstrated
>>>> performance advantage. (At least I hope that's what happened, otherwise
>>>> there's no excuse for it!).
>>>>
>>>> Do you know you get a worthwhile performance benefit for what you're
>>>>
>>>>
>>> doing?
>>>
>>>
>>>>  If not, why do it?
>>>>
>>>> Jonathan
>>>>
>>>>
>>>> Scott Gonyea wrote:
>>>>
>>>>
>>>>
>>>>> I went for a different route:
>>>>>
>>>>> https://issues.apache.org/jira/browse/LUCENE-2644
>>>>>
>>>>> Scott
>>>>>
>>>>> On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcm...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org>
>>>>>>
>>>>>>
>>>>> wrote:
>>>
>>>
>>>>
>>>>>>
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't
>>>>>>>
>>>>>>>
>>>>>> create
>>>
>>>
>>>>  tokens, based solely on lower-casing characters.  Is there a way to
>>>>>>>
>>>>>>>
>>>>>> tell
>>>
>>>
>>>>
>>>>>>>
>>>>>> it
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> NOT to drop non-characters?  It's amazingly frustrating that the
>>>>>>> TokenizerFactory and the FilterFactory have two entirely different
>>>>>>>
>>>>>>>
>>>>>> modes
>>>
>>>
>>>>
>>>>>>>
>>>>>> of
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> behavior.  If I wanted it to tokenize based on non-lower case
>>>>>>> characters....
>>>>>>> wouldn't I use, say, LetterTokenizerFactory and tack on the
>>>>>>> LowerCaseFilterFactory?  Or any number of combinations that would
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> otherwise
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> achieve that specific end-result?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I don't think you should use LowerCaseTokenizerFactory if you dont
>>>>>> want
>>>>>> to
>>>>>> divide text on non-letters, its intended to do just that.
>>>>>>
>>>>>> from the javadocs:
>>>>>> LowerCaseTokenizer performs the function of LetterTokenizer and
>>>>>> LowerCaseFilter together. It divides text at non-letters and converts
>>>>>> them
>>>>>> to lower case. While it is functionally equivalent to the combination
>>>>>>
>>>>>>
>>>>> of
>>>
>>>
>>>> LetterTokenizer and LowerCaseFilter, there is a performance advantage
>>>>>>
>>>>>>
>>>>> to
>>>
>>>
>>>> doing the two tasks at once, hence this (redundant) implementation.
>>>>>>
>>>>>>
>>>>>>
>>>>>> So... Is there a way for me to tell it to NOT split based on
>>>>>> non-characters?
>>>>>>   Use a different tokenizer that doesn't split on non-characters,
>>>>>> followed by
>>>>>> a LowerCaseFilter
>>>>>>
>>>>>> --
>>>>>> Robert Muir
>>>>>> rcm...@gmail.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>> --
>>> Robert Muir
>>> rcm...@gmail.com
>>>
>>>
>>>
>>
>>
>>
>

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Reply via email to