Hello Remi, sorry for the late reply. If this is still an issue today, can you please file this upstream at https://github.com/rsyslog/librelp/ as a pull request. Changes like this should be reviewed and applied by upstream.
Thanks, Michael On Thu, 27 Nov 2014 18:39:15 +0100 Remi Palancher <r...@rezib.org> wrote: > Package: librelp0 > Version: 1.0.0-1 > Severity: important > Tags: upstream patch > > Dear Maintainer, > > While using rsyslog with its relp plugin linked to librelp0, I encountered a > bug where my central rsyslog server entered into an endless loop and it then > basically stopped processing the logs. The rsyslogd process was fully using > one > core on the system. > > To help understand this behaviour, I ran rsyslog in foreground with debug > flag: > > # rsyslogd -c5 -n -d > > The resulting output can be found in attached file > rsyslog_debug_endless_loop_truncated.log.xz. I largely truncated the file to > significantly reduce its size. Feel free to tell me if you need to have a look > to original file 6MB though. You may notice in the output this repeating line > : > "tcpSend returns 0". > > After a deeper analysis I figured out that it was due to one particular > "client" > node whose TCP socket was in a weird state. Every send() that were performed > by > librelp0 on this socket returned -1 and set errno to EAGAIN, endlessly. By > endlessly, I mean approximately 20 days. > > Here the backtrace according to GDB on the breakpoint tcp.c:488: > > #0 relpTcpSend (pThis=pThis@entry=0x69e1c0, pBuf=<optimized out>, > pLenBuf=pLenBuf@entry=0x7ffff475ead8) at tcp.c:488 > #1 0x00007ffff5f71482 in relpSendbufSend (pThis=0x69bbb0, > pTcp=pTcp@entry=0x69e1c0) at sendbuf.c:131 > #2 0x00007ffff5f71274 in relpSendqSend (pThis=pThis@entry=0x6992d0, > pTcp=pTcp@entry=0x69e1c0) at sendq.c:249 > #3 0x00007ffff5f7133a in relpSendqAddBuf (pThis=0x6992d0, pBuf=<optimized > out>, pTcp=0x69e1c0) at sendq.c:170 > #4 0x00007ffff5f6f2a5 in relpSessSendResponse (pThis=pThis@entry=0x69d210, > txnr=<optimized out>, pData=<optimized out>, lenData=<optimized out>) at > relpsess.c:248 > #5 0x00007ffff5f725c9 in relpSCInit (pFrame=pFrame@entry=0x699510, > pSess=pSess@entry=0x69d210) at copen.c:141 > #6 0x00007ffff5f6e863 in relpEngineDispatchFrame (pThis=<optimized out>, > pSess=pSess@entry=0x69d210, pFrame=pFrame@entry=0x699510) at relp.c:474 > #7 0x00007ffff5f7005f in relpFrameProcessOctetRcvd > (ppThis=ppThis@entry=0x69d230, c=10 '\n', pSess=pSess@entry=0x69d210) at > relpframe.c:207 > #8 0x00007ffff5f6ec07 in relpSessRcvData (pThis=0x69d210) at relpsess.c:224 > #9 0x00007ffff5f6e709 in relpEngineRun (pThis=0x6745a0) at relp.c:405 > #10 0x000000000043ae06 in ?? () > #11 0x00007ffff79b0b50 in start_thread () from > /lib/x86_64-linux-gnu/libpthread.so.0 > #12 0x00007ffff72eee6d in clone () from /lib/x86_64-linux-gnu/libc.so.6 > #13 0x0000000000000000 in ?? () > > Function relpSendqSend() has a while loop that loops until there is no > data to send anymore. But with this weird socket state, the loop becomes > endless. When used by rsyslog, this endless loop results in a DOS on the > central rsyslog server. > > The attached patch abort-send-loop-after-max-tries.diff is a proposal to fix > this bug. It counts the number of tries, and if it reaches 100 tries, > relpSendqSend() returns RELP_RET_PARTIAL_WRITE. Note that this return value > is already "managed" by relpSendqAddBuf() in some way. > > It would probably be better to tell rsyslog about this error, other than a > simple debug message, so that it could then appear in rsyslog own logs and > warn the sysadmins of this weird socket/node. But I'm not familiar with > librepl0 and rsyslog code base so I haven't taken time to get deeper into > this. >
signature.asc
Description: OpenPGP digital signature