Interesting results!  I haven't experienced any problems with logging
myself, but I've also never gone into benchmarking it.

I don't have much experience with memory pools, but if you're still
getting blockages on a call to new/malloc you may be able to get
around it by using the memory pool and then calling the constructor
in-place(I have never done this before, but I know it is possible to
do). [1][2].  Or there may be some way to do it using a custom
allocator, that's something that I have never done either.

-Robert Middleton

[1]: 
https://stackoverflow.com/questions/519808/call-a-constructor-on-a-already-allocated-memory
[2]: 
https://stackoverflow.com/questions/222557/what-uses-are-there-for-placement-new

On Thu, Aug 16, 2018 at 11:58 AM, Matt Sicker <boa...@gmail.com> wrote:
> I don't know much about the state of log4cxx architecture, but nearly all
> your points (other than the lock in stringstream) are points we optimize
> for in log4j2 at least. Even the stringstream optimization sounds similar
> to StringBuffer versus StringBuilder in java. As for the queue used in
> async logging, I'm not sure what guarantees you get in C++ memory models,
> but I'm curious if the disruptor queue pattern made its way over to boost?
>
> On Thu, 16 Aug 2018 at 10:37, Denys Smolainiuk <
> denys.smolian...@harmonicinc.com> wrote:
>
>> Hello All,
>>
>> I'd like to share some experience as well as some patches with regard to
>> using log4cxx in timing-critical application. First few words about our
>> requirements: it's a service which must generate some network packets with
>> up to hundred of microseconds precision. Thus, it's very important to have
>> predictable code timing. One can argue that log4cxx is not very well suited
>> for such applications, but surprisingly it works pretty well after some
>> light tuning.
>>
>> So, what were the issues?
>> Basically from library user's point of view they looked the same: one of a
>> sudden logging done with LOG4CXX_DEBUG() macro could take unexpectedly long
>> time to complete. For example the same trace which takes several μs for 99%
>> of the time would take hundreds microseconds and even few milliseconds
>> sometimes. After further investigation this has been traced down to few of
>> the root-causes:
>>
>> 1. Asyns logger (which we have been using of course) has internal queue to
>> pass log entries to background disk-writer thread. This queue is
>> mutex-protected which might seem fine unless you think a little bit more
>> about it. First of all, someone calling LOG4CXX_DEBUG() to simply put
>> something into the log might not expect to be blocked inside waiting for a
>> mutex at all. Second point is that, although there were measures taken to
>> minimize time disk-thread holds that lock, OS-schedulers often work in a
>> way that thread which is blocked on a mutex gets de-scheduled. With normal
>> OS-scheduler quantum that means that the logging thread can be preempted
>> for milliseconds.
>>
>> 2. There are some mutexes protecting internal states of both loggers and
>> appenders. This means that two separate threads calling LOG4CXX_DEBUG() can
>> block each other. Even if they are using different loggers they would block
>> on appender! This has the same consequences for execution timing and the
>> performance as described above.
>>
>> 3. std::stringstream class constructor has some internal locks on it's
>> own. Unfortunately each MessageBuffer has it's own instance of this class.
>> And also unfortunately MessageBuffer is created inside LOG4CXX_DEBUG()
>> macro. There is optimization to not create stringstream for logging simple
>> strings, but as soon as your log statement has single '<<' operator it's
>> created.
>>
>> 4. Dynamic memory allocations. Unfortunately there are still quite few of
>> them even though memory pool is used in some other places. Thus, hidden
>> calls to new and malloc induce unpredictable delays.
>>
>> So, what we did to mitigate these problems?
>>
>> 1. Natural solution for this issue was to use atomic queue. There are few
>> of them available, but we made use of boost::lockfree::queue as it can
>> serve as a drop-in replacement allowing us to keep all present
>> functionality.
>>
>> 2. After looking more into the code it has appeared that two concurrent
>> calls to LOG4CXX_DEBUG() from within different threads are not harmful
>> because internal structures of logger and appender are not being changed
>> there. What only really requires protection is concurrency between logging
>> and configuring. Thus, we came to a solution - read-write locks where
>> logging calls act as readers and configuration/exiting calls are writers.
>> With such approach multiple threads calling LOG4CXX_DEBUG() became free of
>> any contention.
>>
>> 3. This problem also has one simple solution - make one static
>> std::stringstream object per thread using std::thread_local. Unfortunately
>> we found one drawback - thread_local memory is not released if thread is
>> not detached or joined. As there is some code which does neither of this we
>> made static stringstream a xml file configuration option. Also, there could
>> be an issue with using multiple MessageBuffer instances from within single
>> thread, but LOG4CXX_DEBUG() is not doing that.
>>
>> 4. At this time we didn't do anything to address dynamic memory allocation
>> issue.
>>
>> So, if you want to give our patch a try it is attached. It's based on
>> log4cxx 0.10.0-12 as found in Debian. There are new SHARED_MUTEX and
>> LOCK_R/LOCK_W macros defined in mutex.h and synchronized.h for easy
>> switching between RW and convenient mutexes for benchmarking. Also, there
>> is an test application which spawns two threads doing logging with some
>> small sleep between iterations. It prints to stdout if trace statement took
>> more than 500 microseconds. It might look familiar too you because it's
>> based on one of the examples from StackOverflow.
>>
>> In our testing modified log4cxx has about 20% better performance overall,
>> but more importantly for us it has much less cases when log statement takes
>> excessive time to complete. The second part is only true for CPUs with >2
>> cores where all threads can physically run in parallel. Unfortunately we
>> still see rare cases with prolonged logging calls, we have traced that down
>> to dynamic memory allocation issue.
>>
>> Any thoughts?
>>
>> Denys Smolianiuk
>>
>> Senior SW Engineer
>> Harmonic Inc.
>>
>>
>
> --
> Matt Sicker <boa...@gmail.com>

Reply via email to