Hello Remi,

sorry for the late reply.
If this is still an issue today, can you please file this upstream at
https://github.com/rsyslog/librelp/
as a pull request. Changes like this should be reviewed and applied by
upstream.

Thanks,
Michael


On Thu, 27 Nov 2014 18:39:15 +0100 Remi Palancher <r...@rezib.org> wrote:
> Package: librelp0
> Version: 1.0.0-1
> Severity: important
> Tags: upstream patch
> 
> Dear Maintainer,
> 
> While using rsyslog with its relp plugin linked to librelp0, I encountered a
> bug where my central rsyslog server entered into an endless loop and it then
> basically stopped processing the logs. The rsyslogd process was fully using 
> one
> core on the system.
> 
> To help understand this behaviour, I ran rsyslog in foreground with debug 
> flag:
> 
> # rsyslogd -c5 -n -d
> 
> The resulting output can be found in attached file
> rsyslog_debug_endless_loop_truncated.log.xz. I largely truncated the file to
> significantly reduce its size. Feel free to tell me if you need to have a look
> to original file 6MB though. You may notice in the output this repeating line 
> :
> "tcpSend returns 0".
> 
> After a deeper analysis I figured out that it was due to one particular 
> "client"
> node whose TCP socket was in a weird state. Every send() that were performed 
> by
> librelp0 on this socket returned -1 and set errno to EAGAIN, endlessly. By
> endlessly, I mean approximately 20 days.
> 
> Here the backtrace according to GDB on the breakpoint tcp.c:488:
> 
> #0  relpTcpSend (pThis=pThis@entry=0x69e1c0, pBuf=<optimized out>, 
> pLenBuf=pLenBuf@entry=0x7ffff475ead8) at tcp.c:488
> #1  0x00007ffff5f71482 in relpSendbufSend (pThis=0x69bbb0, 
> pTcp=pTcp@entry=0x69e1c0) at sendbuf.c:131
> #2  0x00007ffff5f71274 in relpSendqSend (pThis=pThis@entry=0x6992d0, 
> pTcp=pTcp@entry=0x69e1c0) at sendq.c:249
> #3  0x00007ffff5f7133a in relpSendqAddBuf (pThis=0x6992d0, pBuf=<optimized 
> out>, pTcp=0x69e1c0) at sendq.c:170
> #4  0x00007ffff5f6f2a5 in relpSessSendResponse (pThis=pThis@entry=0x69d210, 
> txnr=<optimized out>, pData=<optimized out>, lenData=<optimized out>) at 
> relpsess.c:248
> #5  0x00007ffff5f725c9 in relpSCInit (pFrame=pFrame@entry=0x699510, 
> pSess=pSess@entry=0x69d210) at copen.c:141
> #6  0x00007ffff5f6e863 in relpEngineDispatchFrame (pThis=<optimized out>, 
> pSess=pSess@entry=0x69d210, pFrame=pFrame@entry=0x699510) at relp.c:474
> #7  0x00007ffff5f7005f in relpFrameProcessOctetRcvd 
> (ppThis=ppThis@entry=0x69d230, c=10 '\n', pSess=pSess@entry=0x69d210) at 
> relpframe.c:207
> #8  0x00007ffff5f6ec07 in relpSessRcvData (pThis=0x69d210) at relpsess.c:224
> #9  0x00007ffff5f6e709 in relpEngineRun (pThis=0x6745a0) at relp.c:405
> #10 0x000000000043ae06 in ?? ()
> #11 0x00007ffff79b0b50 in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #12 0x00007ffff72eee6d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #13 0x0000000000000000 in ?? ()
> 
> Function relpSendqSend() has a while loop that loops until there is no
> data to send anymore. But with this weird socket state, the loop becomes
> endless. When used by rsyslog, this endless loop results in a DOS on the
> central rsyslog server.
> 
> The attached patch abort-send-loop-after-max-tries.diff is a proposal to fix
> this bug. It counts the number of tries, and if it reaches 100 tries, 
> relpSendqSend() returns RELP_RET_PARTIAL_WRITE. Note that this return value
> is already "managed" by relpSendqAddBuf() in some way.
> 
> It would probably be better to tell rsyslog about this error, other than a
> simple debug message, so that it could then appear in rsyslog own logs and
> warn the sysadmins of this weird socket/node. But I'm not familiar with
> librepl0 and rsyslog code base so I haven't taken time to get deeper into
> this.
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to