On Tue, May 5, 2020 at 1:25 PM Yonghong Song <[email protected]> wrote:
>
>
>
> On 5/5/20 12:56 PM, Andrii Nakryiko wrote:
> > On Sun, May 3, 2020 at 11:26 PM Yonghong Song <[email protected]> wrote:
> >>
> >> bpf iterator uses seq_file to provide a lossless
> >> way to transfer data to user space. But we want to call
> >> bpf program after all objects have been traversed, and
> >> bpf program may write additional data to the
> >> seq_file buffer. The current seq_read() does not work
> >> for this use case.
> >>
> >> Besides allowing stop() function to write to the buffer,
> >> the bpf_seq_read() also fixed the buffer size to one page.
> >> If any single call of show() or stop() will emit data
> >> more than one page to cause overflow, -E2BIG error code
> >> will be returned to user space.
> >>
> >> Signed-off-by: Yonghong Song <[email protected]>
> >> ---
> >> kernel/bpf/bpf_iter.c | 128 ++++++++++++++++++++++++++++++++++++++++++
> >> 1 file changed, 128 insertions(+)
> >>
> >> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> >> index 05ae04ac1eca..2674c9cbc3dc 100644
> >> --- a/kernel/bpf/bpf_iter.c
> >> +++ b/kernel/bpf/bpf_iter.c
> >> @@ -26,6 +26,134 @@ static DEFINE_MUTEX(targets_mutex);
> >> /* protect bpf_iter_link changes */
> >> static DEFINE_MUTEX(link_mutex);
> >>
> >> +/* bpf_seq_read, a customized and simpler version for bpf iterator.
> >> + * no_llseek is assumed for this file.
> >> + * The following are differences from seq_read():
> >> + * . fixed buffer size (PAGE_SIZE)
> >> + * . assuming no_llseek
> >> + * . stop() may call bpf program, handling potential overflow there
> >> + */
> >> +static ssize_t bpf_seq_read(struct file *file, char __user *buf, size_t
> >> size,
> >> + loff_t *ppos)
> >> +{
> >> + struct seq_file *seq = file->private_data;
> >> + size_t n, offs, copied = 0;
> >> + int err = 0;
> >> + void *p;
> >> +
> >> + mutex_lock(&seq->lock);
> >> +
> >> + if (!seq->buf) {
> >> + seq->size = PAGE_SIZE;
> >> + seq->buf = kmalloc(seq->size, GFP_KERNEL);
> >> + if (!seq->buf)
> >> + goto Enomem;
> >
> > Why not just mutex_unlock and exit with -ENOMEM? Less goto'ing, more
> > straightforward.
> >
> >> + }
> >> +
> >> + if (seq->count) {
> >> + n = min(seq->count, size);
> >> + err = copy_to_user(buf, seq->buf + seq->from, n);
> >> + if (err)
> >> + goto Efault;
> >> + seq->count -= n;
> >> + seq->from += n;
> >> + copied = n;
> >> + goto Done;
> >> + }
> >> +
> >> + seq->from = 0;
> >> + p = seq->op->start(seq, &seq->index);
> >> + if (!p || IS_ERR(p))
> >
> > IS_ERR_OR_NULL?
>
> Ack.
>
> >
> >> + goto Stop;
> >> +
> >> + err = seq->op->show(seq, p);
> >> + if (seq_has_overflowed(seq)) {
> >> + err = -E2BIG;
> >> + goto Error_show;
> >> + } else if (err) {
> >> + /* < 0: go out, > 0: skip */
> >> + if (likely(err < 0))
> >> + goto Error_show;
> >> + seq->count = 0;
> >> + }
> >
> > This seems a bit more straightforward:
> >
> > if (seq_has_overflowed(seq))
> > err = -E2BIG;
> > if (err < 0)
> > goto Error_show;
> > else if (err > 0)
> > seq->count = 0;
> >
> > Also, I wonder if err > 0 (so skip was requested), should we ignore
> > overflow? So something like:
>
> Think about overflow vs. err > 0 case, I double checked seq_file()
> implementation again, yes, it is skipped. So your suggestion below
> looks reasonable.
>
> >
> > if (err > 0) {
> > seq->count = 0;
> > } else {
> > if (seq_has_overflowed(seq))
> > err = -E2BIG;
> > if (err)
> > goto Error_show;
> > }
> >
> >> +
> >> + while (1) {
> >> + loff_t pos = seq->index;
> >> +
> >> + offs = seq->count;
> >> + p = seq->op->next(seq, p, &seq->index);
> >> + if (pos == seq->index) {
> >> + pr_info_ratelimited("buggy seq_file .next function
> >> %ps "
> >> + "did not updated position index\n",
> >> + seq->op->next);
> >> + seq->index++;
> >> + }
> >> +
> >> + if (!p || IS_ERR(p)) {
> >
> > Same, IS_ERR_OR_NULL.
>
> Ack.
>
> >
> >> + err = PTR_ERR(p);
> >> + break;
> >> + }
> >> + if (seq->count >= size)
> >> + break;
> >> +
> >> + err = seq->op->show(seq, p);
> >> + if (seq_has_overflowed(seq)) {
> >> + if (offs == 0) {
> >> + err = -E2BIG;
> >> + goto Error_show;
> >> + }
> >> + seq->count = offs;
> >> + break;
> >> + } else if (err) {
> >> + /* < 0: go out, > 0: skip */
> >> + seq->count = offs;
> >> + if (likely(err < 0)) {
> >> + if (offs == 0)
> >> + goto Error_show;
> >> + break;
> >> + }
> >> + }
> >
> > Same question here about ignoring overflow if skip was requested.
>
> Yes, we should prioritize err > 0 over overflow.
>
> >
> >> + }
> >> +Stop:
> >> + offs = seq->count;
> >> + /* may call bpf program */
> >> + seq->op->stop(seq, p);
> >> + if (seq_has_overflowed(seq)) {
> >> + if (offs == 0)
> >> + goto Error_stop;
> >> + seq->count = offs;
> >
> > just want to double-check, because it's not clear from the code. If
> > all the start()/show()/next() succeeded, but stop() overflown. Would
> > stop() be called again on subsequent read? Would start/show/next
> > handle this correctly as well?
>
> I am supposed to handle this unless there is a bug...
> The idea is:
> - if start()/show()/next() is fine and stop() overflow,
> we will skip stop() output and move on.
> (if we found out, we skip to the beginning of the
> buffer, we will return -E2BIG. Otherwise, we will return
> 0 here, the user read() may just exit.)
> - next time, when read() called again, the start() will return
> NULL (since previous next() returns NULL) and the control
> will jump to stop(), which will try to do another dump().
>
Right, sounds reasonable :)
> >
> >> + }
> >> +
> >> + n = min(seq->count, size);
> >> + err = copy_to_user(buf, seq->buf, n);
> >> + if (err)
> >> + goto Efault;
> >> + copied = n;
> >> + seq->count -= n;
> >> + seq->from = n;
> >> +Done:
> >> + if (!copied)
> >> + copied = err;
> >> + else
> >> + *ppos += copied;
> >> + mutex_unlock(&seq->lock);
> >> + return copied;
> >> +
> >> +Error_show:
> >> + seq->op->stop(seq, p);
> >> +Error_stop:
> >> + seq->count = 0;
> >> + goto Done;
> >> +
> >> +Enomem:
> >> + err = -ENOMEM;
> >> + goto Done;
> >> +
> >> +Efault:
> >> + err = -EFAULT;
> >> + goto Done;
> >
> > Enomem and Efault seem completely redundant and just add goto
> > complexity to this algorithm. Let's just inline `err =
> > -E(NOMEM|FAULT); goto Done;` instead?
>
> We can do this. This is kind of original seq_read() coding
> style. Agree that we do not need to follow them.
>
> >
> >> +}
> >> +
> >> int bpf_iter_reg_target(struct bpf_iter_reg *reg_info)
> >> {
> >> struct bpf_iter_target_info *tinfo;
> >> --
> >> 2.24.1
> >>