Hello, On Wed, Apr 20, 2022 at 03:43:06PM +0200, Alexander Bluhm wrote: > On Sat, Apr 09, 2022 at 01:51:05AM +0200, Alexandr Nedvedicky wrote: > > updated diff is below. > > I am not sure what Hrvoje actually did test and what not. My > impression was, that he got a panic with the previous version of > this diff, but the machine was stable with the code in current. > > But maybe I got it wrong and we need this code to run pfsync with > IPsec in parallel.
that's correct. Hrvoje was testing several diffs stacked on each other: diff which enables parallel forwarding diff which fixes tunnel descriptor block handling (tdb) for ipsec diff which fixes pfsync > > In general it looks good, a few comments inline. > > > --------8<---------------8<---------------8<------------------8<-------- > > diff --git a/sys/net/if_pfsync.c b/sys/net/if_pfsync.c > > index cb0f3fbdf52..536c3f9cb70 100644 > > --- a/sys/net/if_pfsync.c > > +++ b/sys/net/if_pfsync.c > > @@ -181,6 +181,7 @@ void pfsync_q_del(struct pf_state *); > > > > struct pfsync_upd_req_item { > > TAILQ_ENTRY(pfsync_upd_req_item) ur_entry; > > + TAILQ_ENTRY(pfsync_upd_req_item) ur_snap; > > Do we really need two list entries? My understanding is that the > element is either in the sc_upd_req_list or in sn_upd_req_list. We > could use the same entry for both. But of course with two entries > it is easier to see what is going on. > I prefer to go with two list entries until I'll better understand impact of parallel execution. Note that code in current uses single list entry for queues and snapshots. Hrvoje has reported a panic where state entry got updated and pfsync_update_state() got called. However the same state entry was present on snapshot queue. The pfsync_update_state() on cpu0 just inserted state to update queue, setting a trap for cpu1, which was just processing snapshots. perhaps we will be able to get back to single list entry, however we will have to add extend current logic in pfsync_update_state() to also check if entry still waits to be dispatched. </snip> > > @@ -1583,14 +1586,36 @@ pfsync_grab_snapshot(struct pfsync_snapshot *sn, > > struct pfsync_softc *sc) > > > > for (q = 0; q < PFSYNC_S_COUNT; q++) { > > TAILQ_INIT(&sn->sn_qs[q]); > > - TAILQ_CONCAT(&sn->sn_qs[q], &sc->sc_qs[q], sync_list); > > + > > + while ((st = TAILQ_FIRST(&sc->sc_qs[q])) != NULL) { > > +#ifdef PFSYNC_DEBUG > > + KASSERT(st->snapped == 0); > > +#endif > > I see that there are other #ifdef PFSYNC_DEBUG. But why would you > hide a cheap KASSERT behind another ifdef? If something is wrong > I want to see the crash and not only while debugging. > > I will send a diff that removes existing PFSYNC_DEBUG. I've removed all PFSYNC_DEBUG ifdefs from my diff. </snip> > > +#ifdef PFSYNC_DEBUG > > + KASSERT(!ISSET(tdb->tdb_flags, TDBF_PFSYNC_SNAPPED)); > > +#endif > > + /* SET(tdb->tdb_flags, TDBF_PFSYNC_SNAPPED); */ > > This comment looks like a debugging leftover. yes, indeed. comment got removed. </snip> > > @@ -2525,11 +2569,23 @@ pfsync_delete_tdb(struct tdb *t) > > > > mtx_enter(&sc->sc_tdb_mtx); > > > > + /* > > + * if tdb entry is just being processed (found in snapshot), > > + * then it can not be deleted. we just came too late > > + */ > > + if (ISSET(t->tdb_flags, TDBF_PFSYNC_SNAPPED)) { > > + mtx_leave(&sc->sc_tdb_mtx); > > + return; > > + } > > + > > TAILQ_REMOVE(&sc->sc_tdb_q, t, tdb_sync_entry); > > + > > mtx_enter(&t->tdb_mtx); > > CLR(t->tdb_flags, TDBF_PFSYNC); > > mtx_leave(&t->tdb_mtx); > > > > + tdb_unref(t); > > This should be moved after mtx_leave(&sc->sc_tdb_mtx). Unref may > free and who knows what happens. It is not wise to hold a mutex > for more than needed. > I see, the `tdb_unref()` is outside of sc_tdb_mtx scope in new diff. </snip> > > > > TAILQ_ENTRY(pf_state) sync_list; > > + TAILQ_ENTRY(pf_state) sync_snap; > > Again, do we need sync_snap or could we reuse sync_list? > </snip> > > Again, do we need tdb_sync_entry or could we reuse tdb_sync_snap? > I am not demanding it. If you think the code is clearer with two > enries that would be fine. > Let's go with dedicated list entries for snapshots at least for a while. I plan to revisit if_pfsync.c anyway. The version in diff below is good enough to survive torture by Hrvoje. If I remember correct the if_pfsync.c we currently have in tree trips a panic every two or three days with parallel forwarding enabled. Diff below survived four days under similar pressure. updated diff is below thanks and regards sashan --------8<---------------8<---------------8<------------------8<-------- diff --git a/sys/net/if_pfsync.c b/sys/net/if_pfsync.c index fc6843b541f..3061318cec9 100644 --- a/sys/net/if_pfsync.c +++ b/sys/net/if_pfsync.c @@ -181,6 +181,7 @@ void pfsync_q_del(struct pf_state *); struct pfsync_upd_req_item { TAILQ_ENTRY(pfsync_upd_req_item) ur_entry; + TAILQ_ENTRY(pfsync_upd_req_item) ur_snap; struct pfsync_upd_req ur_msg; }; TAILQ_HEAD(pfsync_upd_reqs, pfsync_upd_req_item); @@ -295,7 +296,7 @@ void pfsync_bulk_update(void *); void pfsync_bulk_fail(void *); void pfsync_grab_snapshot(struct pfsync_snapshot *, struct pfsync_softc *); -void pfsync_drop_snapshot(struct pfsync_snapshot *, struct pfsync_softc *); +void pfsync_drop_snapshot(struct pfsync_snapshot *); void pfsync_send_dispatch(void *); void pfsync_send_pkt(struct mbuf *); @@ -422,8 +423,7 @@ pfsync_clone_destroy(struct ifnet *ifp) sc->sc_deferred = 0; mtx_leave(&sc->sc_deferrals_mtx); - while (!TAILQ_EMPTY(&deferrals)) { - pd = TAILQ_FIRST(&deferrals); + while ((pd = TAILQ_FIRST(&deferrals)) != NULL) { TAILQ_REMOVE(&deferrals, pd, pd_entry); pfsync_undefer(pd, 0); } @@ -1574,6 +1574,9 @@ void pfsync_grab_snapshot(struct pfsync_snapshot *sn, struct pfsync_softc *sc) { int q; + struct pf_state *st; + struct pfsync_upd_req_item *ur; + struct tdb *tdb; sn->sn_sc = sc; @@ -1583,14 +1586,31 @@ pfsync_grab_snapshot(struct pfsync_snapshot *sn, struct pfsync_softc *sc) for (q = 0; q < PFSYNC_S_COUNT; q++) { TAILQ_INIT(&sn->sn_qs[q]); - TAILQ_CONCAT(&sn->sn_qs[q], &sc->sc_qs[q], sync_list); + + while ((st = TAILQ_FIRST(&sc->sc_qs[q])) != NULL) { + KASSERT(st->snapped == 0); + TAILQ_REMOVE(&sc->sc_qs[q], st, sync_list); + TAILQ_INSERT_TAIL(&sn->sn_qs[q], st, sync_snap); + st->snapped = 1; + } } TAILQ_INIT(&sn->sn_upd_req_list); - TAILQ_CONCAT(&sn->sn_upd_req_list, &sc->sc_upd_req_list, ur_entry); + while ((ur = TAILQ_FIRST(&sc->sc_upd_req_list)) != NULL) { + TAILQ_REMOVE(&sc->sc_upd_req_list, ur, ur_entry); + TAILQ_INSERT_TAIL(&sn->sn_upd_req_list, ur, ur_snap); + } TAILQ_INIT(&sn->sn_tdb_q); - TAILQ_CONCAT(&sn->sn_tdb_q, &sc->sc_tdb_q, tdb_sync_entry); + while ((tdb = TAILQ_FIRST(&sc->sc_tdb_q)) != NULL) { + TAILQ_REMOVE(&sc->sc_tdb_q, tdb, tdb_sync_entry); + TAILQ_INSERT_TAIL(&sn->sn_tdb_q, tdb, tdb_sync_snap); + + mtx_enter(&tdb->tdb_mtx); + KASSERT(!ISSET(tdb->tdb_flags, TDBF_PFSYNC_SNAPPED)); + SET(tdb->tdb_flags, TDBF_PFSYNC_SNAPPED); + mtx_leave(&tdb->tdb_mtx); + } sn->sn_len = sc->sc_len; sc->sc_len = PFSYNC_MINPKT; @@ -1606,41 +1626,40 @@ pfsync_grab_snapshot(struct pfsync_snapshot *sn, struct pfsync_softc *sc) } void -pfsync_drop_snapshot(struct pfsync_snapshot *sn, struct pfsync_softc * sc) +pfsync_drop_snapshot(struct pfsync_snapshot *sn) { struct pf_state *st; struct pfsync_upd_req_item *ur; struct tdb *t; int q; - for (q = 0; q < PFSYNC_S_COUNT; q++) { if (TAILQ_EMPTY(&sn->sn_qs[q])) continue; while ((st = TAILQ_FIRST(&sn->sn_qs[q])) != NULL) { - TAILQ_REMOVE(&sn->sn_qs[q], st, sync_list); -#ifdef PFSYNC_DEBUG KASSERT(st->sync_state == q); -#endif + KASSERT(st->snapped == 1); + TAILQ_REMOVE(&sn->sn_qs[q], st, sync_snap); st->sync_state = PFSYNC_S_NONE; + st->snapped = 0; pf_state_unref(st); } } while ((ur = TAILQ_FIRST(&sn->sn_upd_req_list)) != NULL) { - TAILQ_REMOVE(&sn->sn_upd_req_list, ur, ur_entry); + TAILQ_REMOVE(&sn->sn_upd_req_list, ur, ur_snap); pool_put(&sn->sn_sc->sc_pool, ur); } - mtx_enter(&sc->sc_tdb_mtx); while ((t = TAILQ_FIRST(&sn->sn_tdb_q)) != NULL) { - TAILQ_REMOVE(&sn->sn_tdb_q, t, tdb_sync_entry); + TAILQ_REMOVE(&sn->sn_tdb_q, t, tdb_sync_snap); mtx_enter(&t->tdb_mtx); + KASSERT(ISSET(t->tdb_flags, TDBF_PFSYNC_SNAPPED)); + CLR(t->tdb_flags, TDBF_PFSYNC_SNAPPED); CLR(t->tdb_flags, TDBF_PFSYNC); mtx_leave(&t->tdb_mtx); } - mtx_leave(&sc->sc_tdb_mtx); } int @@ -1667,7 +1686,7 @@ pfsync_drop(struct pfsync_softc *sc) struct pfsync_snapshot sn; pfsync_grab_snapshot(&sn, sc); - pfsync_drop_snapshot(&sn, sc); + pfsync_drop_snapshot(&sn); } void @@ -1759,7 +1778,7 @@ pfsync_sendout(void) if (m == NULL) { sc->sc_if.if_oerrors++; pfsyncstat_inc(pfsyncs_onomem); - pfsync_drop_snapshot(&sn, sc); + pfsync_drop_snapshot(&sn); return; } @@ -1769,7 +1788,7 @@ pfsync_sendout(void) m_free(m); sc->sc_if.if_oerrors++; pfsyncstat_inc(pfsyncs_onomem); - pfsync_drop_snapshot(&sn, sc); + pfsync_drop_snapshot(&sn); return; } } @@ -1799,7 +1818,7 @@ pfsync_sendout(void) count = 0; while ((ur = TAILQ_FIRST(&sn.sn_upd_req_list)) != NULL) { - TAILQ_REMOVE(&sn.sn_upd_req_list, ur, ur_entry); + TAILQ_REMOVE(&sn.sn_upd_req_list, ur, ur_snap); bcopy(&ur->ur_msg, m->m_data + offset, sizeof(ur->ur_msg)); @@ -1827,18 +1846,19 @@ pfsync_sendout(void) subh = (struct pfsync_subheader *)(m->m_data + offset); offset += sizeof(*subh); - mtx_enter(&sc->sc_tdb_mtx); count = 0; while ((t = TAILQ_FIRST(&sn.sn_tdb_q)) != NULL) { - TAILQ_REMOVE(&sn.sn_tdb_q, t, tdb_sync_entry); + TAILQ_REMOVE(&sn.sn_tdb_q, t, tdb_sync_snap); pfsync_out_tdb(t, m->m_data + offset); offset += sizeof(struct pfsync_tdb); mtx_enter(&t->tdb_mtx); + KASSERT(ISSET(t->tdb_flags, TDBF_PFSYNC_SNAPPED)); + CLR(t->tdb_flags, TDBF_PFSYNC_SNAPPED); CLR(t->tdb_flags, TDBF_PFSYNC); mtx_leave(&t->tdb_mtx); + tdb_unref(t); count++; } - mtx_leave(&sc->sc_tdb_mtx); bzero(subh, sizeof(*subh)); subh->action = PFSYNC_ACT_TDB; @@ -1856,11 +1876,11 @@ pfsync_sendout(void) count = 0; while ((st = TAILQ_FIRST(&sn.sn_qs[q])) != NULL) { - TAILQ_REMOVE(&sn.sn_qs[q], st, sync_list); -#ifdef PFSYNC_DEBUG + TAILQ_REMOVE(&sn.sn_qs[q], st, sync_snap); KASSERT(st->sync_state == q); -#endif + KASSERT(st->snapped == 1); st->sync_state = PFSYNC_S_NONE; + st->snapped = 0; pfsync_qs[q].write(st, m->m_data + offset); offset += pfsync_qs[q].len; @@ -1916,9 +1936,7 @@ pfsync_insert_state(struct pf_state *st) ISSET(st->state_flags, PFSTATE_NOSYNC)) return; -#ifdef PFSYNC_DEBUG KASSERT(st->sync_state == PFSYNC_S_NONE); -#endif if (sc->sc_len == PFSYNC_MINPKT) timeout_add_sec(&sc->sc_tmo, 1); @@ -2403,16 +2421,15 @@ pfsync_q_ins(struct pf_state *st, int q) struct pfsync_softc *sc = pfsyncif; size_t nlen, sclen; -#if defined(PFSYNC_DEBUG) if (sc->sc_len < PFSYNC_MINPKT) panic("pfsync pkt len is too low %zd", sc->sc_len); -#endif do { mtx_enter(&sc->sc_st_mtx); /* - * If two threads are competing to insert the same state, then - * there must be just single winner. + * There are either two threads trying to update the + * the same state, or the state is just being processed + * (is on snapshot queue). */ if (st->sync_state != PFSYNC_S_NONE) { mtx_leave(&sc->sc_st_mtx); @@ -2450,6 +2467,15 @@ pfsync_q_del(struct pf_state *st) mtx_enter(&sc->sc_st_mtx); q = st->sync_state; + /* + * re-check under mutex + * if state is snapped already, then just bail out, because we came + * too late, the state is being just processed/dispatched to peer. + */ + if ((q == PFSYNC_S_NONE) || (st->snapped)) { + mtx_leave(&sc->sc_st_mtx); + return; + } atomic_sub_long(&sc->sc_len, pfsync_qs[q].len); TAILQ_REMOVE(&sc->sc_qs[q], st, sync_list); if (TAILQ_EMPTY(&sc->sc_qs[q])) @@ -2495,6 +2521,7 @@ pfsync_update_tdb(struct tdb *t, int output) } TAILQ_INSERT_TAIL(&sc->sc_tdb_q, t, tdb_sync_entry); + tdb_ref(t); SET(t->tdb_flags, TDBF_PFSYNC); mtx_leave(&t->tdb_mtx); @@ -2525,7 +2552,17 @@ pfsync_delete_tdb(struct tdb *t) mtx_enter(&sc->sc_tdb_mtx); + /* + * if tdb entry is just being processed (found in snapshot), + * then it can not be deleted. we just came too late + */ + if (ISSET(t->tdb_flags, TDBF_PFSYNC_SNAPPED)) { + mtx_leave(&sc->sc_tdb_mtx); + return; + } + TAILQ_REMOVE(&sc->sc_tdb_q, t, tdb_sync_entry); + mtx_enter(&t->tdb_mtx); CLR(t->tdb_flags, TDBF_PFSYNC); mtx_leave(&t->tdb_mtx); @@ -2536,6 +2573,8 @@ pfsync_delete_tdb(struct tdb *t) atomic_sub_long(&sc->sc_len, nlen); mtx_leave(&sc->sc_tdb_mtx); + + tdb_unref(t); } void diff --git a/sys/net/pfvar.h b/sys/net/pfvar.h index bd7ec1d88e7..558618a0f14 100644 --- a/sys/net/pfvar.h +++ b/sys/net/pfvar.h @@ -749,6 +749,7 @@ struct pf_state { u_int8_t pad[3]; TAILQ_ENTRY(pf_state) sync_list; + TAILQ_ENTRY(pf_state) sync_snap; TAILQ_ENTRY(pf_state) entry_list; SLIST_ENTRY(pf_state) gc_list; RB_ENTRY(pf_state) entry_id; @@ -797,6 +798,7 @@ struct pf_state { pf_refcnt_t refcnt; u_int16_t delay; u_int8_t rt; + u_int8_t snapped; }; /* diff --git a/sys/netinet/ip_ipsp.h b/sys/netinet/ip_ipsp.h index c697994047b..fa46a1e1282 100644 --- a/sys/netinet/ip_ipsp.h +++ b/sys/netinet/ip_ipsp.h @@ -355,6 +355,7 @@ struct tdb { /* tunnel descriptor block */ #define TDBF_PFSYNC 0x40000 /* TDB will be synced */ #define TDBF_PFSYNC_RPL 0x80000 /* Replay counter should be bumped */ #define TDBF_ESN 0x100000 /* 64-bit sequence numbers (ESN) */ +#define TDBF_PFSYNC_SNAPPED 0x200000 /* entry is being dispatched to peer */ #define TDBF_BITS ("\20" \ "\1UNIQUE\2TIMER\3BYTES\4ALLOCATIONS" \ @@ -439,6 +440,7 @@ struct tdb { /* tunnel descriptor block */ TAILQ_HEAD(tdb_policy_head, ipsec_policy) tdb_policy_head; /* [p] */ TAILQ_ENTRY(tdb) tdb_sync_entry; + TAILQ_ENTRY(tdb) tdb_sync_snap; }; enum tdb_counters {