Michael McCandless created LUCENE-9457: ------------------------------------------
Summary: Why is Kuromoji tokenization throughput bimodal? Key: LUCENE-9457 URL: https://issues.apache.org/jira/browse/LUCENE-9457 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless With the recent accidental regression of Japanese (Kuromoji) tokenization throughput due to exciting FST optimizations, we [added new nightly Lucene benchmarks|https://github.com/mikemccand/luceneutil/issues/64] to measure tokenization throughput for {{JapaneseTokenizer}}: [https://home.apache.org/~mikemccand/lucenebench/analyzers.html] It has already been running for ~5-6 weeks now! But for some reason, it looks bi-modal? "Normally" it is ~.45 M tokens/sec, but for two data points it dropped down to ~.33 M tokens/sec, which is odd. It could be hotspot noise maybe? But would be good to get to the root cause and fix it if possible. Hotspot noise that randomly steals ~27% of your tokenization throughput is no good!! Or does anyone have any other ideas of what could be bi-modal in Kuromoji? I don't think [this performance test|https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/TestAnalyzerPerf.java] has any randomness in it... -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org