Re: Solr/Lucene Tokenizers - cannot get the behavior I need

Shawn Heisey Sat, 17 Nov 2012 12:17:54 -0800

On 11/16/2012 12:52 PM, Shawn Heisey wrote:

On 11/16/2012 12:36 PM, Jack Krupansky wrote:
Generally, you don't need the preserveOriginal attribute for WDF.Generate both the word parts and the concatenated terms, and queriesshould work fine without the original. The separated terms will beindexed as a sequence, and the split/separated terms will generate aphrase query that matches the indexed sequence. And if you index theconcatenated terms, that can be queried as well.
With that issue out of the way, is there a remaining issue here?
You're right, that's handled by catenateWords. I do needpreserveOriginal for other things, though. I think it's unimportantfor this discussion. I may consider removing it at a later stage, butright now our assessment is that we need it.
The immediate problem is that when ICUTokenizer is done with an inputof "Word1-Word2" I am left with two tokens, Word1 and Word2. Thepunctuation in the middle is gone. Even if WDF is the very next thingin the analysis chain, there's nothing for it to do - the fact thatWord1 and Word2 were connected by punctuation is entirely lost.

Ideally I would like to see a "splitOnPunctuation" option on a majorityof available tokenizers, but if a filter were available that did onesubset of ICUTokenizer's functionality - splitting tokens on scriptchanges - I would have a solution in combination with WhiteSpaceTokenizer.

I have been looking at the source code related to ICUTokenizer, tryingto get a handle on how it works. Based on what I've learned so far, I'mnot sure that punctuation can be ignored in the way that I need. Ifsomeone knows it well enough to comment, I would love to know for sure.


Thanks,
Shawn

Re: Solr/Lucene Tokenizers - cannot get the behavior I need

Reply via email to