On 11-Dec-07, at 11:51 AM, Ken Krugler wrote:
Hi all,
I've got a pattern in a document (call it "xy") that I want to turn
into two tokens - "xy" and "y".
One approach I could use is PatternTokenizer to extract "xy", and
then a custom filter that returns "xy" and then "y" on the next
call (caches the next result).
Or I could extend PatternTokenizer to return multiple tokens per
match, though figuring out how to specify that in the schema seems
harder.
Is there another approach that wouldn't require any custom code?
Not that I can think of. Perhaps the natural way of extending
PatterTokenizer to return subtokens is to use the grouping of the
regular expression. That is, specify "x(y)" to return both. I
assume that java has a non-selecting re group operator (it's (?:) in
python) so the basic grouping functionality would not be lost.
Python does this for re.split, which I find nice:
>>> re.split('a(b)c', 'oneabctwoabcthree')
['one', 'b', 'two', 'b', 'three']