On 11/14/17 5:22 AM, Sowmini Varadhan wrote:
A few questions.
- First off, why am I not seeing the original mail in this thread
even when I search the mail archives, e.g.,
https://lkml.org/lkml/2017/11/13/954
- Girish Moodalbail writes:
The issue here is that we are trying to reference a network namespace
(struct net *) that is long gone (i.e., L532 below -- c_net is the culprit).
The netns is not "long gone", we are still processing
the NETDEV_UNREGISTER_FINAL for loopback.
Obviously, I was not talking about the current namespace.
Say there are two namespaces - ns1 and ns2 and that both have RDS connections.
Deletion of ns1 will be fine. However when ns2 is being deleted, in the
rds_tcp_dev_event() callback we walk through the global list and some nodes in
that list will be referring to ns1 (that is "long gone"). If you read my earlier
email, I was talking about ns1 which is already gone, and we are trying to
access it from ns2.
~Girish
As I said in my
earlier mail, the idea is to extract the list of unique conns
that belong to the netns and then destroy both the conn, and
all associated paths. Thus there can only be a single thread
going through rds_tcp_kill_sock at any time (since we should
only get the unregister_final/loopback one time for the netns).
(See alos comment block in rds_tcp_dev_event about network activity
quiescing). Thus there should be no concurrency issue.
However when I just ehecked this, there may be some rds connection
refcounting bug. When I quickly tested this, I'm not seeing the
expected calls to conn_path_destroy. I'll need some time to take
a look, this has been known to work, so something got broken along
the way
I think we should move away from global list to a per-namespace list. The
global list are used only in two places (both of which are per-namespace
operations):
let's first understand the real root-cause before we start
redesigning data-structures.
--Sowmini