[ https://issues.apache.org/jira/browse/LOG4J2-2606?focusedWorklogId=246563&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-246563 ]
ASF GitHub Bot logged work on LOG4J2-2606: ------------------------------------------ Author: ASF GitHub Bot Created on: 22/May/19 03:43 Start Date: 22/May/19 03:43 Worklog Time Spent: 10m Work Description: carterkozak commented on issue #273: LOG4J2-2606: Substantially improve async logging performance under heavy load URL: https://github.com/apache/logging-log4j2/pull/273#issuecomment-494640721 > Can we experiment a bit before moving ahead? @carterkozak You reported this happened in your production environment. Can I ask how you've dealt with that? Is your organization waiting for a quick Log4j fix, or do you have some other interim solution? Absolutely, I'd much rather get this done correctly than quickly. For the time being we've implemented a custom `AsyncQueueFullPolicy` which uses `EventRoute.SYNCHRONOUS` instead of `ENQUEUE`. Event order isn't terribly problematic in our case because because our events use a custom json format, and flow through an automated pipeline where they're indexed and presented in order based on timestamps. > I'm also surprised that all the WaitStrategies were reported to show the same behaviour when the ringbuffer is full and under load. The proposed change (synchronizing on a mutex object) seems to do the same as the BlockingWaitStrategy provided by the Disruptor. So I would expect at least the BlockingWaitStrategy to show some improvement... The BlockingWaitStrategy in disruptor master does appear much closer to what I've written than the version in 3.4.2, however if I backport that version of the file to this branch and run benchmarks (values in queuFullPolicy=ENQUEUE_UNSYNCHRONIZED), the java process CPU utilization exceeds 1000% (across 6 cores + hyperthreading) and throughput is lower than the timeout strategy. (disclaimer: benchmarks run on a laptop, cpu throttling and other processes produce noise) ``` Benchmark (queueFullPolicy) Mode Cnt Score Error Units ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads ENQUEUE thrpt 3 1269003.278 ± 858705.609 ops/s ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads ENQUEUE_UNSYNCHRONIZED thrpt 3 296842.480 ± 1060494.701 ops/s ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads SYNCHRONOUS thrpt 3 1095597.447 ± 767282.544 ops/s ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread ENQUEUE thrpt 3 1231449.118 ± 1874760.323 ops/s ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread ENQUEUE_UNSYNCHRONIZED thrpt 3 1329153.582 ± 2409041.282 ops/s ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread SYNCHRONOUS thrpt 3 1336944.414 ± 636546.820 ops/s ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 246563) Time Spent: 1h (was: 50m) > Asynchronous logging when the queue is full results heavy CPU load > ------------------------------------------------------------------ > > Key: LOG4J2-2606 > URL: https://issues.apache.org/jira/browse/LOG4J2-2606 > Project: Log4j 2 > Issue Type: Bug > Affects Versions: 2.11.2 > Reporter: Carter Kozak > Assignee: Carter Kozak > Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > I've encountered unexpected performance characteristics when the asynchronous > logging queue is full, resulting in the application becoming unavailable for > an extended period of time. > I think EventRoute.ENQUEUE is using a spin lock (or similar construct). When > many threads attempt to log at a higher rate than we can drain the queue, > logging throughput drops significantly while CPU utilization skyrockets > causing instability across the system. > I can reproduce this in a benchmark (I will clean it up and push when I have > a moment) where I get the following results: > Setup: > Root logger has a random access file appender with immediateFlush disabled > AsyncLoggerContextSelector enabled for completely asynchronous logging > 1 thread logging, any queue full policy: ~1,400k events/second, 150% CPU > 32 threads logging, default policy (EventRoute.ENQUEUE, default policy): > 300k-400k events/second. 930% CPU load > 32 threads logging, custom policy using EventRoute.SYNCHRONOUS: 1,200k > events/second. 350% CPU load > Using the synchronous policy results in ~1/3 the CPU load and 3-4x throughput > when the system is loaded to the point that the logging queue is filled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)