Hi Micke, There is some good research in here - have you had a chance to create some issues in Jira from this?
On Fri, Feb 23, 2018 at 6:28 AM, Michael Burman <mibur...@redhat.com> wrote: > Hi, > > I was referring to this article by Shipilev (there are few small issues > forgotten in that url you pasted): > > https://shipilev.net/blog/2014/nanotrusting-nanotime/ > > And his lovely recommendation on it: "System.nanoTime is as bad as > String.intern now: you can use it, but use it wisely. ". And Cassandra uses > it quite a lot in the write path at least. There isn't necessarily a better > option in Java for it, but for that reason we shouldn't push them everywhere > in the code "for fun". > > - Micke > > > > On 02/22/2018 06:08 PM, Jeremiah D Jordan wrote: >> >> re: nanoTime vs currentTimeMillis there is a good blog post here about the >> timing of both and how your choice of Linux clock source can drastically >> effect the speed of the calls, and also showing that in general on linux >> there is no perf improvement for one over the other. >> http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html >> >>> On Feb 22, 2018, at 11:01 AM, Blake Eggleston <beggles...@apple.com> >>> wrote: >>> >>> Hi Micke, >>> >>> This is really cool, thanks for taking the time to investigate this. I >>> believe the metrics around memtable insert time come in handy in identifying >>> high partition contention in the memtable. I know I've been involved in a >>> situation over the past year where we got actionable info from this metric. >>> Reducing resolution to milliseconds is probably a no go since most things in >>> this path should complete in less than a millisecond. >>> >>> Revisiting the use of the codahale metrics in the hot path like this >>> definitely seems like a good idea though. I don't think it's been something >>> we've talked about a lot, and it definitely looks like we could benefit from >>> using something more specialized here. I think it's worth doing, especially >>> since there won't be any major changes to how we do threading in 4.0. It's >>> probably also worth opening a JIRA and investigating the calls to nano time. >>> We at least need microsecond resolution here, and there could be something >>> we haven't thought of? It's worth a look at least. >>> >>> Thanks, >>> >>> Blake >>> >>> On 2/22/18, 6:10 AM, "Michael Burman" <mibur...@redhat.com> wrote: >>> >>> Hi, >>> >>> I wanted to get some input from the mailing list before making a JIRA >>> and potential fixes. I'll touch the performance more on latter part, >>> but >>> there's one important question regarding the write latency metric >>> recording place. Currently we measure the writeLatency (and metric >>> write >>> sampler..) in ColumnFamilyStore.apply() and this is also the metric >>> we >>> then replicate to Keyspace metrics etc. >>> >>> This is an odd place for writeLatency. Not to mention it is in a >>> hot-path of Memtable-modifications, but it also does not measure the >>> real write latency, since it completely ignores the CommitLog latency >>> in >>> that same process. Is the intention really to measure >>> Memtable-modification latency only or the actual write latencies? >>> >>> Then the real issue.. this single metric is a cause of huge overhead >>> in >>> Memtable processing. There are several metrics / events in the CFS >>> apply >>> method, including metric sampler, storageHook reportWrite, >>> colUpdateTimeDeltaHistogram and metric.writeLatency. These are not >>> free >>> at all when it comes to the processing. I made a small JMH benchmark >>> here: >>> https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade >>> that I'll be referring to. >>> >>> The most offending of all these metrics is the writeLatency metric. >>> What >>> it does is update the latency in codahale's timer, doing a histogram >>> update and then going through all the parent metrics also which >>> update >>> the keyspace writeLatency and globalWriteLatency. When measuring the >>> performance of Memtable.put with parameter of 1 partition (to reduce >>> the >>> ConcurrentSkipListMap search speed impact - that's separate issue and >>> takes a little bit longer to solve although I've started to prototype >>> something..) on my machine I see 1.3M/s performance with the metric >>> and >>> when it is disabled the performance climbs to 4M/s. So the overhead >>> for >>> this single metric is ~2/3 of total performance. That's insane. My >>> perf >>> stats indicate that the CPU is starved as it can't get enough data >>> in. >>> >>> Removing the replication from TableMetrics to the Keyspace & global >>> latencies in the write time (and doing this when metrics are >>> requested >>> instead) improves the performance to 2.1M/s on my machine. It's an >>> improvement, but it's still huge amount. Even when we pressure the >>> ConcurrentSkipListMap with 100 000 partitions in one active Memtable, >>> the performance drops by about ~40% due to this metric, so it's never >>> free. >>> >>> i did not find any discussion replacing the metric processing with >>> something faster, so has this been considered before? At least for >>> these >>> performance sensitive ones. The other issue is obviously the use of >>> System.nanotime() which by itself is very slow (two System.nanotime() >>> calls eat another ~1M/s from the performance) >>> >>> My personal quick fix would be to move writeLatency to >>> Keyspace.apply, >>> change write time aggregates to read time processing (metrics are >>> read >>> less often than we write data) and maybe even reduce the nanotime -> >>> currentTimeMillis (even given it's relative lack of precision). That >>> is >>> - if these metrics make any sense at all at CFS level? Maybe these >>> should be measured from the network processing time (including all >>> the >>> deserializations and such) ? Especially if at some point the smarter >>> threading / eventlooping changes go forward (in which case they might >>> sleep at some "queue" for a while). >>> >>> - Micke >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>> >>> >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: dev-h...@cassandra.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org