Re: Pattern that generates two tokens per match

Mike Klaas Tue, 11 Dec 2007 13:07:42 -0800

On 11-Dec-07, at 11:51 AM, Ken Krugler wrote:

Hi all,
I've got a pattern in a document (call it "xy") that I want to turninto two tokens - "xy" and "y".
One approach I could use is PatternTokenizer to extract "xy", andthen a custom filter that returns "xy" and then "y" on the nextcall (caches the next result).
Or I could extend PatternTokenizer to return multiple tokens permatch, though figuring out how to specify that in the schema seemsharder.
Is there another approach that wouldn't require any custom code?

Not that I can think of. Perhaps the natural way of extendingPatterTokenizer to return subtokens is to use the grouping of theregular expression. That is, specify "x(y)" to return both. Iassume that java has a non-selecting re group operator (it's (?:) inpython) so the basic grouping functionality would not be lost.


Python does this for re.split, which I find nice:

>>> re.split('a(b)c', 'oneabctwoabcthree')
 ['one', 'b', 'two', 'b', 'three']

Re: Pattern that generates two tokens per match

Reply via email to