On 06/18/2017 04:39 AM, Lawrence Brakmo wrote:
On 6/16/17, 6:58 AM, "Daniel Borkmann" <dan...@iogearbox.net> wrote:
[...]
> /* Change congestion control for socket */
> -int tcp_set_congestion_control(struct sock *sk, const char *name)
> +int tcp_set_congestion_control(struct sock *sk, const char *name, bool
load)
> {
> struct inet_connection_sock *icsk = inet_csk(sk);
> const struct tcp_congestion_ops *ca;
> @@ -344,7 +344,10 @@ int tcp_set_congestion_control(struct sock *sk,
const char *name)
> return -EPERM;
>
> rcu_read_lock();
> - ca = __tcp_ca_find_autoload(name);
> + if (!load)
> + ca = tcp_ca_find(name);
> + else
> + ca = __tcp_ca_find_autoload(name);
From BPF program side, we call with !load since we're not allowed
to sleep under RCU, that's correct ...
> /* No change asking for existing value */
> if (ca == icsk->icsk_ca_ops) {
> icsk->icsk_ca_setsockopt = 1;
> @@ -352,8 +355,10 @@ int tcp_set_congestion_control(struct sock *sk,
const char *name)
> }
> if (!ca)
> err = -ENOENT;
> + else if (!load)
> + icsk->icsk_ca_ops = ca;
... but don't we also need to hold a module ref in this case as done
below?
Meaning, tcp_ca_find() could return a ca that was previously loaded
to the tcp_cong_list as module, then resulting in ref count imbalance
when set from BPF?
As I mentioned above, this can be called before congestion has been
initialized (op <= BPF_SOCKET_OPS_NEEDS_ECN) in which case
tcp_init_congestion_control will be called later. If op > ..OPS_NEEDS_ECN
then bpf_setsockopt() will call the reinit_congestion_control().
But this points to an issue where someone else could call
tcp_set_congestion_control() with load == false not knowing they
need to call either init or reinit. I will add a comment to the function
to make it clear.
Hm, I'm not sure it answers my question. What I meant was that from BPF
prog, you're setting tcp_set_congestion_control(..., false) so if
tcp_ca_find() returns a ca that was loaded earlier as a from a module
(so it becomes available in tcp_cong_list), the above...
[...]
else if (!load)
icsk->icsk_ca_ops = ca;
[...]
... will basically prevent the later try_module_get() on the ca. So any
later tcp_reinit_congestion_control() or tcp_init_congestion_control()
will still run not having the refcount held on the owner module. Meaning
a module unload would let the machine crash due to the refcnt imbalance?
What am I missing?