Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-10-21 Thread Steven Price
On 09/09/2024 10:46, Kirill A. Shutemov wrote:
> On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
>> On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
>>> On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
 Some applications rely on placing data in free bits addresses allocated
 by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
 address returned by mmap to be less than the 48-bit address space,
 unless the hint address uses more than 47 bits (the 48th bit is reserved
 for the kernel address space).

 The riscv architecture needs a way to similarly restrict the virtual
 address space. On the riscv port of OpenJDK an error is thrown if
 attempted to run on the 57-bit address space, called sv57 [1].  golang
 has a comment that sv57 support is not complete, but there are some
 workarounds to get it to mostly work [2].
> 
> I also saw libmozjs crashing with 57-bit address space on x86.
> 
 These applications work on x86 because x86 does an implicit 47-bit
 restriction of mmap() address that contain a hint address that is less
 than 48 bits.

 Instead of implicitly restricting the address space on riscv (or any
 current/future architecture), a flag would allow users to opt-in to this
 behavior rather than opt-out as is done on other architectures. This is
 desirable because it is a small class of applications that do pointer
 masking.
> 
> You reiterate the argument about "small class of applications". But it
> makes no sense to me.

Sorry to chime in late on this - I had been considering implementing
something like MAP_BELOW_HINT and found this thread.

While the examples of applications that want to use high VA bits and get
bitten by future upgrades is not very persuasive. It's worth pointing
out that there are a variety of somewhat horrid hacks out there to work
around this feature not existing.

E.g. from my brief research into other code:

  * Box64 seems to have a custom allocator based on reading 
/proc/self/maps to allocate a block of VA space with a low enough 
address [1]

  * PHP has code reading /proc/self/maps - I think this is to find a 
segment which is close enough to the text segment [2]

  * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit
addresses [3][4]

  * pmdk has some funky code to find the lowest address that meets 
certain requirements - this does look like an ALSR alternative and 
probably couldn't directly use MAP_BELOW_HINT, although maybe this 
suggests we need a mechanism to map without a VA-range? [5]

  * MIT-Scheme parses /proc/self/maps to find the lowest mapping within 
a range [6]

  * LuaJIT uses an approach to 'probe' to find a suitable low address 
for allocation [7]

The biggest benefit I see of MAP_BELOW_HINT is that it would allow a
library to get low addresses without causing any problems for the rest
of the application. The use case I'm looking at is in a library and 
therefore a personality mode wouldn't be appropriate (because I don't 
want to affect the rest of the application). Reading /proc/self/maps
is also problematic because other threads could be allocating/freeing
at the same time.

Thanks,
Steve


[1] https://sources.debian.org/src/box64/0.3.0+dfsg-1/src/custommem.c/
[2] 
https://sources.debian.org/src/php8.2/8.2.24-1/ext/opcache/shared_alloc_mmap.c/#L62
[3] https://github.com/FEX-Emu/FEX/blob/main/FEXCore/Source/Utils/Allocator.cpp
[4] 
https://github.com/FEX-Emu/FEX/commit/df2f1ad074e5cdfb19a0bd4639b7604f777fb05c
[5] 
https://sources.debian.org/src/pmdk/1.13.1-1.1/src/common/mmap_posix.c/?hl=29#L29
[6] https://sources.debian.org/src/mit-scheme/12.1-3/src/microcode/ux.c/#L826
[7] 
https://sources.debian.org/src/luajit/2.1.0+openresty20240815-1/src/lj_alloc.c/

> With full address space by default, this small class of applications is
> going to *broken* unless they would handle RISC-V case specifically.
> 
> On other hand, if you limit VA to 128TiB by default (like many
> architectures do[1]) everything would work without intervention.
> And if an app needs wider address space it would get it with hint opt-in,
> because it is required on x86-64 anyway. Again, no RISC-V-specific code.
> 
> I see no upside with your approach. Just worse user experience.
> 
> [1] See va_high_addr_switch test case in 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/mm/Makefile#n115
> 


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-10-21 Thread Liam R. Howlett
* Steven Price  [241021 09:23]:
> On 09/09/2024 10:46, Kirill A. Shutemov wrote:
> > On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
> >> On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
> >>> On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
>  Some applications rely on placing data in free bits addresses allocated
>  by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
>  address returned by mmap to be less than the 48-bit address space,
>  unless the hint address uses more than 47 bits (the 48th bit is reserved
>  for the kernel address space).
> 
>  The riscv architecture needs a way to similarly restrict the virtual
>  address space. On the riscv port of OpenJDK an error is thrown if
>  attempted to run on the 57-bit address space, called sv57 [1].  golang
>  has a comment that sv57 support is not complete, but there are some
>  workarounds to get it to mostly work [2].
> > 
> > I also saw libmozjs crashing with 57-bit address space on x86.
> > 
>  These applications work on x86 because x86 does an implicit 47-bit
>  restriction of mmap() address that contain a hint address that is less
>  than 48 bits.
> 
>  Instead of implicitly restricting the address space on riscv (or any
>  current/future architecture), a flag would allow users to opt-in to this
>  behavior rather than opt-out as is done on other architectures. This is
>  desirable because it is a small class of applications that do pointer
>  masking.
> > 
> > You reiterate the argument about "small class of applications". But it
> > makes no sense to me.
> 
> Sorry to chime in late on this - I had been considering implementing
> something like MAP_BELOW_HINT and found this thread.
> 
> While the examples of applications that want to use high VA bits and get
> bitten by future upgrades is not very persuasive. It's worth pointing
> out that there are a variety of somewhat horrid hacks out there to work
> around this feature not existing.
> 
> E.g. from my brief research into other code:
> 
>   * Box64 seems to have a custom allocator based on reading 
> /proc/self/maps to allocate a block of VA space with a low enough 
> address [1]
> 
>   * PHP has code reading /proc/self/maps - I think this is to find a 
> segment which is close enough to the text segment [2]
> 
>   * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit
> addresses [3][4]

Can't the limited number of applications that need to restrict the upper
bound use an LD_PRELOAD compatible library to do this?

> 
>   * pmdk has some funky code to find the lowest address that meets 
> certain requirements - this does look like an ALSR alternative and 
> probably couldn't directly use MAP_BELOW_HINT, although maybe this 
> suggests we need a mechanism to map without a VA-range? [5]
> 
>   * MIT-Scheme parses /proc/self/maps to find the lowest mapping within 
> a range [6]
> 
>   * LuaJIT uses an approach to 'probe' to find a suitable low address 
> for allocation [7]
> 

Although I did not take a deep dive into each example above, there are
some very odd things being done, we will never cover all the use cases
with an exact API match.  What we have today can be made to work for
these users as they have figured ways to do it.

Are they pretty? no.  Are they common? no.  I'm not sure it's worth
plumbing in new MM code in for these users.

> The biggest benefit I see of MAP_BELOW_HINT is that it would allow a
> library to get low addresses without causing any problems for the rest
> of the application. The use case I'm looking at is in a library and 
> therefore a personality mode wouldn't be appropriate (because I don't 
> want to affect the rest of the application). Reading /proc/self/maps
> is also problematic because other threads could be allocating/freeing
> at the same time.

As long as you don't exhaust the lower limit you are trying to allocate
within - which is exactly the issue riscv is hitting.

I understand that you are providing examples to prove that this is
needed, but I feel like you are better demonstrating the flexibility
exists to implement solutions in different ways using todays API.

I think it would be best to use the existing methods and work around the
issue that was created in riscv while future changes could mirror amd64
and arm64.

...
> 
> 
> [1] https://sources.debian.org/src/box64/0.3.0+dfsg-1/src/custommem.c/
> [2] 
> https://sources.debian.org/src/php8.2/8.2.24-1/ext/opcache/shared_alloc_mmap.c/#L62
> [3] 
> https://github.com/FEX-Emu/FEX/blob/main/FEXCore/Source/Utils/Allocator.cpp
> [4] 
> https://github.com/FEX-Emu/FEX/commit/df2f1ad074e5cdfb19a0bd4639b7604f777fb05c
> [5] 
> https://sources.debian.org/src/pmdk/1.13.1-1.1/src/common/mmap_posix.c/?hl=29#L29
> [6] https://sources.debian.org/src/mit-scheme/12.1-3/src/microcode/ux.c/#L826
> [7] 
> https://sources.debi

Re: [PATCH v6 6/8] x86/module: prepare module loading for ROX allocations of text

2024-10-21 Thread Nathan Chancellor
Hi Mike,

On Wed, Oct 16, 2024 at 03:24:22PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" 
> 
> When module text memory will be allocated with ROX permissions, the
> memory at the actual address where the module will live will contain
> invalid instructions and there will be a writable copy that contains the
> actual module code.
> 
> Update relocations and alternatives patching to deal with it.
> 
> Signed-off-by: Mike Rapoport (Microsoft) 

Sorry that you have to hear from me again :) It seems that module
loading is still broken with this version of the patch, which is
something that I missed in my earlier testing since I only test a
monolithic kernel with my regular virtual machine testing. If I build
and install the kernel and modules in the VM via a distribution package,
I get the following splat at boot:

  Starting systemd-udevd version 256.7-1-arch
  [0.882312] SMP alternatives: Something went horribly wrong trying to 
rewrite the CFI implementation.
  [0.883526] CFI failure at do_one_initcall+0x128/0x380 (target: 
init_module+0x0/0xff0 [crc32c_intel]; expected type: 0x0c7a3a22)
  [0.884802] Oops: invalid opcode:  [#1] PREEMPT SMP NOPTI
  [0.885434] CPU: 3 UID: 0 PID: 157 Comm: modprobe Tainted: G    W  
6.12.0-rc3-debug-next-20241021-06324-g63b3ff03d91a #1 
291f0fd70f293827edec681d3c5304f5807a3c7b
  [0.887084] Tainted: [W]=WARN
  [0.887409] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
unknown 2/2/2022
  [0.888241] RIP: 0010:do_one_initcall+0x128/0x380
  [0.888720] Code: f3 0f 1e fa 41 be ff ff ff ff e9 0f 01 00 00 0f 1f 44 00 
00 41 81 e7 ff ff ff 7f 49 89 db 41 ba de c5 85 f3 45 03 53 f1 74 02 <0f> 0b 41 
ff d3 0f 1f 00 41 89 c6 0f 1f 44 00 00 c6 04 24 00 65 8b
  [0.890598] RSP: 0018:ff3f93e5c052f970 EFLAGS: 00010217
  [0.891129] RAX: b4c105b8 RBX: c0602010 RCX: 

  [0.891850] RDX:  RSI:  RDI: 
c0602010
  [0.892588] RBP: ff3f93e5c052fc88 R08: 0020 R09: 

  [0.893305] R10: 2a378b84 R11: c0602010 R12: 
69c6
  [0.894003] R13: ff1f0090c5596900 R14: ff1f0090c15a55c0 R15: 

  [0.894693] FS:  7ffb712c0740() GS:ff1f00942fb8() 
knlGS:
  [0.895453] CS:  0010 DS:  ES:  CR0: 80050033
  [0.896020] CR2: 7c4424c8 CR3: 000100af4002 CR4: 
00771ef0
  [0.896698] DR0:  DR1:  DR2: 

  [0.897391] DR3:  DR6: fffe0ff0 DR7: 
0400
  [0.898077] PKRU: 5554
  [0.898337] Call Trace:
  [0.898577]  
  [0.898784]  ? __die_body+0x6a/0xb0
  [0.899132]  ? die+0xa4/0xd0
  [0.899413]  ? do_trap+0xa6/0x180
  [0.899740]  ? do_one_initcall+0x128/0x380
  [0.900130]  ? do_one_initcall+0x128/0x380
  [0.900523]  ? handle_invalid_op+0x6a/0x90
  [0.900917]  ? do_one_initcall+0x128/0x380
  [0.901311]  ? exc_invalid_op+0x38/0x60
  [0.901679]  ? asm_exc_invalid_op+0x1a/0x20
  [0.902081]  ? __cfi_init_module+0x10/0x10 [crc32c_intel 
5331566c5540f82df397056699bc4ddac8be1306]
  [0.902933]  ? __cfi_init_module+0x10/0x10 [crc32c_intel 
5331566c5540f82df397056699bc4ddac8be1306]
  [0.903781]  ? __cfi_init_module+0x10/0x10 [crc32c_intel 
5331566c5540f82df397056699bc4ddac8be1306]
  [0.904634]  ? do_one_initcall+0x128/0x380
  [0.905028]  ? idr_alloc_cyclic+0x139/0x1d0
  [0.905437]  ? security_kernfs_init_security+0x54/0x190
  [0.905958]  ? __kernfs_new_node+0x1ba/0x240
  [0.906377]  ? sysfs_create_dir_ns+0x8f/0x140
  [0.906795]  ? kernfs_link_sibling+0xf2/0x110
  [0.907211]  ? kernfs_activate+0x2c/0x110
  [0.907599]  ? kernfs_add_one+0x108/0x150
  [0.907981]  ? __kernfs_create_file+0x75/0xa0
  [0.908407]  ? sysfs_create_bin_file+0xc6/0x120
  [0.908853]  ? __vunmap_range_noflush+0x347/0x420
  [0.909313]  ? _raw_spin_unlock+0xe/0x30
  [0.909692]  ? free_unref_page+0x22c/0x4c0
  [0.910097]  ? __kmalloc_cache_noprof+0x1a8/0x360
  [0.910546]  do_init_module+0x60/0x250
  [0.910910]  __se_sys_finit_module+0x316/0x420
  [0.911351]  do_syscall_64+0x88/0x170
  [0.911699]  ? __x64_sys_lseek+0x68/0xb0
  [0.912077]  ? syscall_exit_to_user_mode+0x97/0xc0
  [0.912538]  ? do_syscall_64+0x94/0x170
  [0.912902]  ? syscall_exit_to_user_mode+0x97/0xc0
  [0.913353]  ? do_syscall_64+0x94/0x170
  [0.913709]  ? clear_bhb_loop+0x45/0xa0
  [0.914071]  ? clear_bhb_loop+0x45/0xa0
  [0.914428]  ? clear_bhb_loop+0x45/0xa0
  [0.914767]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [0.915089] RIP: 0033:0x7ffb713dc1fd
  [0.915316] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 
f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0