Hi all,

I'm trying to use PatternTokenizer and not getting expected results.
Not sure where the failure lies. What I'm trying to do is split my
input on whitespace except in cases where the whitespace is preceded
by a hyphen character. So to do this I'm using a negative look behind
assertion in the pattern, e.g. "(?<!-)\s+".

Expected behavior:
"foo bar" -> ["foo","bar"] - OK
"foo \n bar" -> ["foo","bar"] - OK
"foo- bar" -> ["foo- bar"] - OK
"foo-\nbar" -> ["foo-\nbar"] - OK
"foo- \n bar" -> ["foo- \n bar"] - FAILS

Here's a test case that demonstrates the failure:

        public void testPattern() throws Exception {
                Map<String,String> args = new HashMap<String, String>();
                args.put( PatternTokenizerFactory.GROUP, "-1" );
                args.put( PatternTokenizerFactory.PATTERN, "(?<!-)\\s+" );
                Reader reader = new StringReader("blah \n foo bar- 
baz\nfoo-\nbar-
baz foo- \n bar");
            PatternTokenizerFactory tokFactory = new PatternTokenizerFactory();
            tokFactory.init( args );
            TokenStream stream = tokFactory.create( reader );
            assertTokenStreamContents(stream, new String[] { "blah", "foo",
"bar- baz", "foo-\nbar- baz", "foo- \n bar" });
        }

This fails with the following output:
"org.junit.ComparisonFailure: term 4 expected:<foo- [\n bar]> but was:<foo- []>"

Am I doing something wrong? Incorrect expectations? Or could this be a bug?

Thanks,
--jay

Reply via email to