On Thu, Aug 23, 2018 at 10:05:30AM +0200, Pietro Stäheli wrote: > Hi, > > openBGPd is running at an internet exchange, two openBSD route servers > (rs3 on openBSD 6.3 and rs4 on openBSD 6.2, both virtual machines on > different hypervisors in different locations) connect with peering > customers. > > We've experienced crashes in openBGPd twice in the past two weeks. Both > times with the same error message: "fatal in RDE: Uh, oh a politician in > the decision process". These error messages are logged on both route > servers right before they crash within seconds of each other. > > The route servers had been running quite reliably for a long time before > the recent incidents. The daemon can then be restarted without an issue. > CPU usage prior to the crash is minimal (<5%). > > In the minutes before the crash we're seeing error messages like the > following in daemon.log: > > bgpd[23099]: no such peer: id=4294967037 > > > Sample of logs just before the crash: > > Aug 22 15:38:58 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.170 > AS6939: update 81.163.124.0/24 via 91.206.52.170 > Aug 22 15:38:58 rs3 bgpd[23099]: no such peer: id=4294967037 > Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11 > AS31424: withdraw 2a01:6a8::/32 > Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf > AS33891: withdraw 2a01:6a8::/32 > Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa > AS6939: withdraw 2804:364c::/33 > Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa > AS6939: withdraw 2804:364c:8000::/33 > Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11 > AS31424: update 2a01:6a8::/32 via 2001:7f8:24::11 > Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa > AS6939: update 2804:364c::/33 via 2001:7f8:24::aa > Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa > AS6939: update 2804:364c:8000::/33 via 2001:7f8:24::aa > Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf > AS33891: update 2a01:6a8::/32 via 2001:7f8:24::bf > Aug 22 15:39:00 rs3 bgpd[23099]: Connection attempt from neighbor > 91.206.52.139 while session is in state Idle > Aug 22 15:39:01 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.96 > AS31042: update 185.64.172.0/24 via 91.206.52.96 > Aug 22 15:39:01 rs3 bgpd[43763]: fatal in RDE: Uh, oh a politician in > the decision process > Aug 22 15:39:01 rs3 bgpd[99961]: peer closed imsg connection > Aug 22 15:39:01 rs3 bgpd[99961]: main: Lost connection to RDE > Aug 22 15:39:01 rs3 bgpd[23099]: peer closed imsg connection > Aug 22 15:39:01 rs3 bgpd[23099]: SE: Lost connection to parent > > > Logs just before the "no such peer" messages appear: > > Aug 22 15:36:43 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.54 > AS34554: update 80.75.112.0/20 via 91.206.52.54 > Aug 22 15:36:43 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::36 > AS34554: update 2a01:6a8::/32 via 2001:7f8:24::36 > Aug 22 15:36:44 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf > AS33891: update 2a0d:8d80::/32 via 2001:7f8:24::bf > Aug 22 15:36:47 rs3 bgpd[23099]: neighbor 91.206.52.96: graceful restart > of IPv4 unicast, keeping routes > Aug 22 15:36:47 rs3 bgpd[23099]: neighbor 91.206.52.96: state change > Established -> Idle, reason: Connection closed > Aug 22 15:36:47 rs3 bgpd[23099]: neighbor 91.206.52.96: removed > Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11 > AS31424: withdraw 2a01:6a8::/32 > Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf > AS33891: withdraw 2a01:6a8::/32 > Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11 > AS31424: update 2a01:6a8::/32 via 2001:7f8:24::11 > Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf > AS33891: update 2a01:6a8::/32 via 2001:7f8:24::bf > Aug 22 15:36:54 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf > AS33891: update 2a0d:8d80::/32 via 2001:7f8:24::bf > Aug 22 15:36:55 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.170 > AS6939: update 197.249.160.0/19 via 91.206.52.170 > Aug 22 15:36:55 rs3 bgpd[23099]: no such peer: id=4294967037 > > > > I haven't found much about the error message apart from this mailing > list thread: https://www.mail-archive.com/[email protected]/msg04565.html > > The thread suggests that invoking bgpctl may cause the failure. 'bgpctl > show' is invoked every few minutes through our monitoring system to > check on the status of peer connections. > > Can anybody point me to a possible cause or troubleshooting regarding > this issue? Could a misconfigured/broken peer be the cause? Has anybody > dealt with a similar problem? > > I can provide bgpd.conf and full logs of both incidents if necessary. >
Are you using templates (aka neighbors with a netmask in the config)? The peer id seems to suggest that... If so can you add 'announce restart no' to the template and recheck? I think the issue is that a clone of a previous neighbor is created on reconnect and then the stale routes (from graceful reload) and the new routes of this clone are identical. I need to look into this a bit more (just returned from vacation). -- :wq Claudio

