[jira] [Created] (LUCENE-9457) Why is Kuromoji tokenization throughput bimodal?

Michael McCandless (Jira) Tue, 11 Aug 2020 13:38:43 -0700

Michael McCandless created LUCENE-9457:
------------------------------------------


             Summary: Why is Kuromoji tokenization throughput bimodal?
                 Key: LUCENE-9457
                 URL: https://issues.apache.org/jira/browse/LUCENE-9457
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Michael McCandless


With the recent accidental regression of Japanese (Kuromoji) tokenization 
throughput due to exciting FST optimizations, we [added new nightly Lucene 
benchmarks|https://github.com/mikemccand/luceneutil/issues/64] to measure 
tokenization throughput for {{JapaneseTokenizer}}: 
[https://home.apache.org/~mikemccand/lucenebench/analyzers.html]

It has already been running for ~5-6 weeks now!  But for some reason, it looks 
bi-modal?  "Normally" it is ~.45 M tokens/sec, but for two data points it 
dropped down to ~.33 M tokens/sec, which is odd.  It could be hotspot noise 
maybe?  But would be good to get to the root cause and fix it if possible.

Hotspot noise that randomly steals ~27% of your tokenization throughput is no 
good!!

Or does anyone have any other ideas of what could be bi-modal in Kuromoji?  I 
don't think [this performance 
test|https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/TestAnalyzerPerf.java]
 has any randomness in it...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-9457) Why is Kuromoji tokenization throughput bimodal?

Reply via email to