TokenizerChain.getMultiTermAnalyzer().normalize() no longer normalizes multiterms in 8.x?!

Tim Allison Fri, 25 Jan 2019 05:32:39 -0800

All,
  I don't know if this change was intended, but it feels like a bug to me...


TokenFilterFactory[] filters = new TokenFilterFactory[2];
filters[0] = new LowerCaseFilterFactory(Collections.EMPTY_MAP);
filters[1] = new ASCIIFoldingFilterFactory(Collections.EMPTY_MAP);
TokenizerChain chain = new TokenizerChain (
        new MockTokenizerFactory(Collections.EMPTY_MAP), filters);
System.out.println("NORMALIZE: " + chain.normalize("text0",
"f\u00F6\u00F6Ba").utf8ToString());
System.out.println("NORMALIZE with multiterm: " +
chain.getMultiTermAnalyzer().normalize("text0",
"f\u00F6\u00F6Ba").utf8ToString());

output:
NORMALIZE: fooba
NORMALIZE with multiterm: fööBa

If this is a bug and not the desired behavior, the source of the
problem is that in TokenizerChain's getMultiTermAnalyzer(), there's no
override of #normalize(String fieldName, TokenStream ts)...which means
that the multiTermAnalyzer returned by TokenizerChain doesn't actually
work to normalize multiterms!

If this is a bug, I'll open a ticket.

TokenizerChain.getMultiTermAnalyzer().normalize() no longer normalizes multiterms in 8.x?!

Reply via email to