All these sound like very nice improvements! On Fri, Aug 17, 2018 at 4:13 AM, Robert Middleton <osfan6...@gmail.com> wrote:
> Interesting results! I haven't experienced any problems with logging > myself, but I've also never gone into benchmarking it. > > I don't have much experience with memory pools, but if you're still > getting blockages on a call to new/malloc you may be able to get > around it by using the memory pool and then calling the constructor > in-place(I have never done this before, but I know it is possible to > do). [1][2]. Or there may be some way to do it using a custom > allocator, that's something that I have never done either. > > -Robert Middleton > > [1]: https://stackoverflow.com/questions/519808/call-a- > constructor-on-a-already-allocated-memory > [2]: https://stackoverflow.com/questions/222557/what-uses- > are-there-for-placement-new > > On Thu, Aug 16, 2018 at 11:58 AM, Matt Sicker <boa...@gmail.com> wrote: > > I don't know much about the state of log4cxx architecture, but nearly all > > your points (other than the lock in stringstream) are points we optimize > > for in log4j2 at least. Even the stringstream optimization sounds similar > > to StringBuffer versus StringBuilder in java. As for the queue used in > > async logging, I'm not sure what guarantees you get in C++ memory models, > > but I'm curious if the disruptor queue pattern made its way over to > boost? > > > > On Thu, 16 Aug 2018 at 10:37, Denys Smolainiuk < > > denys.smolian...@harmonicinc.com> wrote: > > > >> Hello All, > >> > >> I'd like to share some experience as well as some patches with regard to > >> using log4cxx in timing-critical application. First few words about our > >> requirements: it's a service which must generate some network packets > with > >> up to hundred of microseconds precision. Thus, it's very important to > have > >> predictable code timing. One can argue that log4cxx is not very well > suited > >> for such applications, but surprisingly it works pretty well after some > >> light tuning. > >> > >> So, what were the issues? > >> Basically from library user's point of view they looked the same: one > of a > >> sudden logging done with LOG4CXX_DEBUG() macro could take unexpectedly > long > >> time to complete. For example the same trace which takes several μs for > 99% > >> of the time would take hundreds microseconds and even few milliseconds > >> sometimes. After further investigation this has been traced down to few > of > >> the root-causes: > >> > >> 1. Asyns logger (which we have been using of course) has internal queue > to > >> pass log entries to background disk-writer thread. This queue is > >> mutex-protected which might seem fine unless you think a little bit more > >> about it. First of all, someone calling LOG4CXX_DEBUG() to simply put > >> something into the log might not expect to be blocked inside waiting > for a > >> mutex at all. Second point is that, although there were measures taken > to > >> minimize time disk-thread holds that lock, OS-schedulers often work in a > >> way that thread which is blocked on a mutex gets de-scheduled. With > normal > >> OS-scheduler quantum that means that the logging thread can be preempted > >> for milliseconds. > >> > >> 2. There are some mutexes protecting internal states of both loggers and > >> appenders. This means that two separate threads calling LOG4CXX_DEBUG() > can > >> block each other. Even if they are using different loggers they would > block > >> on appender! This has the same consequences for execution timing and the > >> performance as described above. > >> > >> 3. std::stringstream class constructor has some internal locks on it's > >> own. Unfortunately each MessageBuffer has it's own instance of this > class. > >> And also unfortunately MessageBuffer is created inside LOG4CXX_DEBUG() > >> macro. There is optimization to not create stringstream for logging > simple > >> strings, but as soon as your log statement has single '<<' operator it's > >> created. > >> > >> 4. Dynamic memory allocations. Unfortunately there are still quite few > of > >> them even though memory pool is used in some other places. Thus, hidden > >> calls to new and malloc induce unpredictable delays. > >> > >> So, what we did to mitigate these problems? > >> > >> 1. Natural solution for this issue was to use atomic queue. There are > few > >> of them available, but we made use of boost::lockfree::queue as it can > >> serve as a drop-in replacement allowing us to keep all present > >> functionality. > >> > >> 2. After looking more into the code it has appeared that two concurrent > >> calls to LOG4CXX_DEBUG() from within different threads are not harmful > >> because internal structures of logger and appender are not being changed > >> there. What only really requires protection is concurrency between > logging > >> and configuring. Thus, we came to a solution - read-write locks where > >> logging calls act as readers and configuration/exiting calls are > writers. > >> With such approach multiple threads calling LOG4CXX_DEBUG() became free > of > >> any contention. > >> > >> 3. This problem also has one simple solution - make one static > >> std::stringstream object per thread using std::thread_local. > Unfortunately > >> we found one drawback - thread_local memory is not released if thread is > >> not detached or joined. As there is some code which does neither of > this we > >> made static stringstream a xml file configuration option. Also, there > could > >> be an issue with using multiple MessageBuffer instances from within > single > >> thread, but LOG4CXX_DEBUG() is not doing that. > >> > >> 4. At this time we didn't do anything to address dynamic memory > allocation > >> issue. > >> > >> So, if you want to give our patch a try it is attached. It's based on > >> log4cxx 0.10.0-12 as found in Debian. There are new SHARED_MUTEX and > >> LOCK_R/LOCK_W macros defined in mutex.h and synchronized.h for easy > >> switching between RW and convenient mutexes for benchmarking. Also, > there > >> is an test application which spawns two threads doing logging with some > >> small sleep between iterations. It prints to stdout if trace statement > took > >> more than 500 microseconds. It might look familiar too you because it's > >> based on one of the examples from StackOverflow. > >> > >> In our testing modified log4cxx has about 20% better performance > overall, > >> but more importantly for us it has much less cases when log statement > takes > >> excessive time to complete. The second part is only true for CPUs with > >2 > >> cores where all threads can physically run in parallel. Unfortunately we > >> still see rare cases with prolonged logging calls, we have traced that > down > >> to dynamic memory allocation issue. > >> > >> Any thoughts? > >> > >> Denys Smolianiuk > >> > >> Senior SW Engineer > >> Harmonic Inc. > >> > >> > > > > -- > > Matt Sicker <boa...@gmail.com> >