Interesting results! I haven't experienced any problems with logging myself, but I've also never gone into benchmarking it.
I don't have much experience with memory pools, but if you're still getting blockages on a call to new/malloc you may be able to get around it by using the memory pool and then calling the constructor in-place(I have never done this before, but I know it is possible to do). [1][2]. Or there may be some way to do it using a custom allocator, that's something that I have never done either. -Robert Middleton [1]: https://stackoverflow.com/questions/519808/call-a-constructor-on-a-already-allocated-memory [2]: https://stackoverflow.com/questions/222557/what-uses-are-there-for-placement-new On Thu, Aug 16, 2018 at 11:58 AM, Matt Sicker <boa...@gmail.com> wrote: > I don't know much about the state of log4cxx architecture, but nearly all > your points (other than the lock in stringstream) are points we optimize > for in log4j2 at least. Even the stringstream optimization sounds similar > to StringBuffer versus StringBuilder in java. As for the queue used in > async logging, I'm not sure what guarantees you get in C++ memory models, > but I'm curious if the disruptor queue pattern made its way over to boost? > > On Thu, 16 Aug 2018 at 10:37, Denys Smolainiuk < > denys.smolian...@harmonicinc.com> wrote: > >> Hello All, >> >> I'd like to share some experience as well as some patches with regard to >> using log4cxx in timing-critical application. First few words about our >> requirements: it's a service which must generate some network packets with >> up to hundred of microseconds precision. Thus, it's very important to have >> predictable code timing. One can argue that log4cxx is not very well suited >> for such applications, but surprisingly it works pretty well after some >> light tuning. >> >> So, what were the issues? >> Basically from library user's point of view they looked the same: one of a >> sudden logging done with LOG4CXX_DEBUG() macro could take unexpectedly long >> time to complete. For example the same trace which takes several μs for 99% >> of the time would take hundreds microseconds and even few milliseconds >> sometimes. After further investigation this has been traced down to few of >> the root-causes: >> >> 1. Asyns logger (which we have been using of course) has internal queue to >> pass log entries to background disk-writer thread. This queue is >> mutex-protected which might seem fine unless you think a little bit more >> about it. First of all, someone calling LOG4CXX_DEBUG() to simply put >> something into the log might not expect to be blocked inside waiting for a >> mutex at all. Second point is that, although there were measures taken to >> minimize time disk-thread holds that lock, OS-schedulers often work in a >> way that thread which is blocked on a mutex gets de-scheduled. With normal >> OS-scheduler quantum that means that the logging thread can be preempted >> for milliseconds. >> >> 2. There are some mutexes protecting internal states of both loggers and >> appenders. This means that two separate threads calling LOG4CXX_DEBUG() can >> block each other. Even if they are using different loggers they would block >> on appender! This has the same consequences for execution timing and the >> performance as described above. >> >> 3. std::stringstream class constructor has some internal locks on it's >> own. Unfortunately each MessageBuffer has it's own instance of this class. >> And also unfortunately MessageBuffer is created inside LOG4CXX_DEBUG() >> macro. There is optimization to not create stringstream for logging simple >> strings, but as soon as your log statement has single '<<' operator it's >> created. >> >> 4. Dynamic memory allocations. Unfortunately there are still quite few of >> them even though memory pool is used in some other places. Thus, hidden >> calls to new and malloc induce unpredictable delays. >> >> So, what we did to mitigate these problems? >> >> 1. Natural solution for this issue was to use atomic queue. There are few >> of them available, but we made use of boost::lockfree::queue as it can >> serve as a drop-in replacement allowing us to keep all present >> functionality. >> >> 2. After looking more into the code it has appeared that two concurrent >> calls to LOG4CXX_DEBUG() from within different threads are not harmful >> because internal structures of logger and appender are not being changed >> there. What only really requires protection is concurrency between logging >> and configuring. Thus, we came to a solution - read-write locks where >> logging calls act as readers and configuration/exiting calls are writers. >> With such approach multiple threads calling LOG4CXX_DEBUG() became free of >> any contention. >> >> 3. This problem also has one simple solution - make one static >> std::stringstream object per thread using std::thread_local. Unfortunately >> we found one drawback - thread_local memory is not released if thread is >> not detached or joined. As there is some code which does neither of this we >> made static stringstream a xml file configuration option. Also, there could >> be an issue with using multiple MessageBuffer instances from within single >> thread, but LOG4CXX_DEBUG() is not doing that. >> >> 4. At this time we didn't do anything to address dynamic memory allocation >> issue. >> >> So, if you want to give our patch a try it is attached. It's based on >> log4cxx 0.10.0-12 as found in Debian. There are new SHARED_MUTEX and >> LOCK_R/LOCK_W macros defined in mutex.h and synchronized.h for easy >> switching between RW and convenient mutexes for benchmarking. Also, there >> is an test application which spawns two threads doing logging with some >> small sleep between iterations. It prints to stdout if trace statement took >> more than 500 microseconds. It might look familiar too you because it's >> based on one of the examples from StackOverflow. >> >> In our testing modified log4cxx has about 20% better performance overall, >> but more importantly for us it has much less cases when log statement takes >> excessive time to complete. The second part is only true for CPUs with >2 >> cores where all threads can physically run in parallel. Unfortunately we >> still see rare cases with prolonged logging calls, we have traced that down >> to dynamic memory allocation issue. >> >> Any thoughts? >> >> Denys Smolianiuk >> >> Senior SW Engineer >> Harmonic Inc. >> >> > > -- > Matt Sicker <boa...@gmail.com>