On 11-Dec-07, at 11:51 AM, Ken Krugler wrote:

Hi all,

I've got a pattern in a document (call it "xy") that I want to turn into two tokens - "xy" and "y".

One approach I could use is PatternTokenizer to extract "xy", and then a custom filter that returns "xy" and then "y" on the next call (caches the next result).

Or I could extend PatternTokenizer to return multiple tokens per match, though figuring out how to specify that in the schema seems harder.

Is there another approach that wouldn't require any custom code?

Not that I can think of. Perhaps the natural way of extending PatterTokenizer to return subtokens is to use the grouping of the regular expression. That is, specify "x(y)" to return both. I assume that java has a non-selecting re group operator (it's (?:) in python) so the basic grouping functionality would not be lost.

Python does this for re.split, which I find nice:

>>> re.split('a(b)c', 'oneabctwoabcthree')
 ['one', 'b', 'two', 'b', 'three']



Reply via email to