On Fri, 22 Feb 2019 15:59:30 -0800 Andy Lutomirski <[email protected]> wrote:
> On Fri, Feb 22, 2019 at 3:02 PM Jann Horn <[email protected]> wrote: > > > > On Fri, Feb 22, 2019 at 11:39 PM Nadav Amit <[email protected]> wrote: > > > > On Feb 22, 2019, at 2:21 PM, Nadav Amit <[email protected]> wrote: > > > > > > > >> On Feb 22, 2019, at 2:17 PM, Jann Horn <[email protected]> wrote: > > > >> > > > >> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <[email protected]> wrote: > > > >>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <[email protected]> wrote: > > > >>>> > > > >>>> (adding some people from the text_poke series to the thread, > > > >>>> removing stable@) > > > >>>> > > > >>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski > > > >>>> <[email protected]> wrote: > > > >>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov > > > >>>>>> <[email protected]> wrote: > > > >>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote: > > > >>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800 > > > >>>>>>> Alexei Starovoitov <[email protected]> wrote: > > > >>>>>>> > > > >>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote: > > > >>>>>>>>> > > > >>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to > > > >>>>>>>>> not > > > >>>>>>>>> allow user accesses. The easiest way to do that is actually > > > >>>>>>>>> likely to > > > >>>>>>>>> use the "unsafe_get_user()" functions *without* doing a > > > >>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply > > > >>>>>>>>> fault > > > >>>>>>>>> on a kernel access to user space. > > > >>>>>>>> > > > >>>>>>>> On bpf side the bpf_probe_read() helper just calls > > > >>>>>>>> probe_kernel_read() > > > >>>>>>>> and users pass both user and kernel addresses into it and expect > > > >>>>>>>> that the helper will actually try to read from that address. > > > >>>>>>>> > > > >>>>>>>> If __probe_kernel_read will suddenly start failing on all user > > > >>>>>>>> addresses > > > >>>>>>>> it will break the expectations. > > > >>>>>>>> How do we solve it in bpf_probe_read? > > > >>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user > > > >>>>>>>> byte-by-byte > > > >>>>>>>> in the loop? > > > >>>>>>>> That's doable, but people already complain that bpf_probe_read() > > > >>>>>>>> is slow > > > >>>>>>>> and shows up in their perf report. > > > >>>>>>> > > > >>>>>>> We're changing kprobes to add a specific flag to say that we want > > > >>>>>>> to > > > >>>>>>> differentiate between kernel or user reads. Can this be done with > > > >>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a > > > >>>>>>> single > > > >>>>>> > > > >>>>>> so you're saying you will break existing kprobe scripts? > > > >>>>>> I don't think it's a good idea. > > > >>>>>> It's not acceptable to break bpf_probe_read uapi. > > > >>>>> > > > >>>>> If so, the uapi is wrong: a long-sized number does not reliably > > > >>>>> identify an address if you don’t separately know whether it’s a > > > >>>>> user or kernel address. s390x and 4G:4G x86_32 are the notable > > > >>>>> exceptions. I have lobbied for RISC-V and future x86_64 to join the > > > >>>>> crowd. I don’t know whether I’ll win this fight, but the uapi will > > > >>>>> probably have to change for at least s390x. > > > >>>>> > > > >>>>> What to do about existing scripts is a different question. > > > >>>> > > > >>>> This lack of logical separation between user and kernel addresses > > > >>>> might interact interestingly with the text_poke series, specifically > > > >>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for > > > >>>> patching" > > > >>>> (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&sdata=HAbnDcrBne64JyPuVUMKmM7nQk67F%2BFvjuXEn8TmHeo%3D&reserved=0) > > > >>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text > > > >>>> poking" > > > >>>> (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&sdata=vNRIMKtFDy%2F3z5FlTwDiJY6VGEV%2FMHgQPTdFSFtCo4s%3D&reserved=0), > > > >>>> right? If someone manages to get a tracing BPF program to trigger in > > > >>>> a > > > >>>> task that has switched to the patching mm, could they use > > > >>>> bpf_probe_write_user() - which uses probe_kernel_write() after > > > >>>> checking that KERNEL_DS isn't active and that access_ok() passes - to > > > >>>> overwrite kernel text that is mapped writable in the patching mm? > > > >>> > > > >>> Yes, this is a good point. I guess text_poke() should be defined with > > > >>> “__kprobes” and open-code memcpy. > > > >>> > > > >>> Does it sound reasonable? > > > >> > > > >> Doesn't __text_poke() as implemented in the proposed patch use a > > > >> couple other kernel functions, too? Like switch_mm_irqs_off() and > > > >> pte_clear() (which can be a call into a separate function on paravirt > > > >> kernels)? > > > > > > > > I will move the pte_clear() to be done after the poking mm was unloaded. > > > > Give me a few minutes to send a sketch of what I think should be done. > > > > > > Err.. You are right, I don’t see an easy way of preventing a kprobe from > > > being set on switch_mm_irqs_off(), and open-coding this monster is too > > > ugly. > > > > > > The reasonable solution seems to me as taking all the relevant pieces of > > > code (and data) that might be used during text-poking and encapsulating > > > them, so they > > > will be set in a memory area which cannot be kprobe'd. This can also be > > > useful to write-protect data structures of code that calls text_poke(), > > > e.g., static-keys. It can also protect data on that stack that is used > > > during text_poke() from being overwritten from another core. > > > > > > This solution is somewhat similar to Igor Stoppa’s idea of using > > > “enclaves” > > > when doing write-rarely operations. > > > > > > Right now, I think that text_poke() will keep being susceptible to such > > > an attack, unless you have a better suggestion. > > > > A relatively simple approach might be to teach BPF not to run kprobe > > programs and such in contexts where current->mm isn't the active mm? > > Maybe using nmi_uaccess_okay(), or something like that? It looks like > > perf_callchain_user() also already uses that. Except that a lot of > > this code is x86-specific... > > This sounds like exactly the right solution. If you're running from > some unknown context (like NMI or tracing), then you should check > nmi_uaccess_okay(). I think we should just promote that to be a > non-arch-specific function (that returns true by default) and check it > the relevant bpf_probe_xyz() functions. This treat may also need for my work, like probe_user_read() we should fail if nmi_uaccess_okay(). Thank you, > > Alexei, does that seem reasonable? -- Masami Hiramatsu <[email protected]>
