Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-10-23 Thread Steven Price
Hi Liam,

On 21/10/2024 20:48, Liam R. Howlett wrote:
> * Steven Price  [241021 09:23]:
>> On 09/09/2024 10:46, Kirill A. Shutemov wrote:
>>> On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
 On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
> On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
>> Some applications rely on placing data in free bits addresses allocated
>> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
>> address returned by mmap to be less than the 48-bit address space,
>> unless the hint address uses more than 47 bits (the 48th bit is reserved
>> for the kernel address space).
>>
>> The riscv architecture needs a way to similarly restrict the virtual
>> address space. On the riscv port of OpenJDK an error is thrown if
>> attempted to run on the 57-bit address space, called sv57 [1].  golang
>> has a comment that sv57 support is not complete, but there are some
>> workarounds to get it to mostly work [2].
>>>
>>> I also saw libmozjs crashing with 57-bit address space on x86.
>>>
>> These applications work on x86 because x86 does an implicit 47-bit
>> restriction of mmap() address that contain a hint address that is less
>> than 48 bits.
>>
>> Instead of implicitly restricting the address space on riscv (or any
>> current/future architecture), a flag would allow users to opt-in to this
>> behavior rather than opt-out as is done on other architectures. This is
>> desirable because it is a small class of applications that do pointer
>> masking.
>>>
>>> You reiterate the argument about "small class of applications". But it
>>> makes no sense to me.
>>
>> Sorry to chime in late on this - I had been considering implementing
>> something like MAP_BELOW_HINT and found this thread.
>>
>> While the examples of applications that want to use high VA bits and get
>> bitten by future upgrades is not very persuasive. It's worth pointing
>> out that there are a variety of somewhat horrid hacks out there to work
>> around this feature not existing.
>>
>> E.g. from my brief research into other code:
>>
>>   * Box64 seems to have a custom allocator based on reading 
>> /proc/self/maps to allocate a block of VA space with a low enough 
>> address [1]
>>
>>   * PHP has code reading /proc/self/maps - I think this is to find a 
>> segment which is close enough to the text segment [2]
>>
>>   * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit
>> addresses [3][4]
> 
> Can't the limited number of applications that need to restrict the upper
> bound use an LD_PRELOAD compatible library to do this?

I'm not entirely sure what point you are making here. Yes an LD_PRELOAD
approach could be used instead of a personality type as a 'hack' to
preallocate the upper address space. The obvious disadvantage is that
you can't (easily) layer LD_PRELOAD so it won't work in the general case.

>>
>>   * pmdk has some funky code to find the lowest address that meets 
>> certain requirements - this does look like an ALSR alternative and 
>> probably couldn't directly use MAP_BELOW_HINT, although maybe this 
>> suggests we need a mechanism to map without a VA-range? [5]
>>
>>   * MIT-Scheme parses /proc/self/maps to find the lowest mapping within 
>> a range [6]
>>
>>   * LuaJIT uses an approach to 'probe' to find a suitable low address 
>> for allocation [7]
>>
> 
> Although I did not take a deep dive into each example above, there are
> some very odd things being done, we will never cover all the use cases
> with an exact API match.  What we have today can be made to work for
> these users as they have figured ways to do it.
> 
> Are they pretty? no.  Are they common? no.  I'm not sure it's worth
> plumbing in new MM code in for these users.

My issue with the existing 'solutions' is that they all seem to be fragile:

 * Using /proc/self/maps is inherently racy if there could be any other
code running in the process at the same time.

 * Attempting to map the upper part of the address space only works if
done early enough - once an allocation arrives there, there's very
little you can robustly do (because the stray allocation might be freed).

 * LuaJIT's probing mechanism is probably robust, but it's inefficient -
LuaJIT has a fallback of linear probing, following by no hint (ASLR),
followed by pseudo-random probing. I don't know the history of the code
but it looks like it's probably been tweaked to try to avoid performance
issues.

>> The biggest benefit I see of MAP_BELOW_HINT is that it would allow a
>> library to get low addresses without causing any problems for the rest
>> of the application. The use case I'm looking at is in a library and 
>> therefore a personality mode wouldn't be appropriate (because I don't 
>> want to affect the rest of the application). Reading /proc/self/maps
>> is also problematic because othe

[PATCH v7 0/8] x86/module: use large ROX pages for text allocations

2024-10-23 Thread Mike Rapoport
From: "Mike Rapoport (Microsoft)" 

Hi,

This is an updated version of execmem ROX caches.

v6: https://lore.kernel.org/all/20241016122424.1655560-1-r...@kernel.org
* Fixed handling of alternatives for fineibt (kbuild bot)
* Restored usage of text_poke_early for ftrace boot time initialization (Steve)
* Made !module path in module_writable_address inline

v5: https://lore.kernel.org/all/20241009180816.83591-1-r...@kernel.org 
* Droped check for !area in mas_for_each() loop (Kees Bakker)
* Droped externs in include/linux/vmalloc.h (Christoph)
* Fixed handling of alternatives for CFI-enabled configs (Nathan)
* Fixed interaction with kmemleak (Sergey).
  It looks like execmem and kmemleak interaction should be improved
  further, but it's out of scope of this series.
* Added ARCH_HAS_EXECMEM_ROX configuration option to arch/Kconfig. The
  option serves two purposes:
  - make sure architecture that uses ROX caches implements
execmem_fill_trapping_insns() callback (Christoph)
  - make sure entire physical memory is mapped in the direct map (Dave) 

v4: https://lore.kernel.org/all/20241007062858.44248-1-r...@kernel.org
* Fix copy/paste error in looongarch (Huacai)

v3: https://lore.kernel.org/all/20240909064730.3290724-1-r...@kernel.org
* Drop ftrace_swap_func(). It is not needed because mcount array lives
  in a data section (Peter)
* Update maple_tree usage (Liam)
* Set ->fill_trapping_insns pointer on init (Ard)
* Instead of using VM_FLUSH_RESET_PERMS for execmem cache, completely
  remove it from the direct map

v2: https://lore.kernel.org/all/20240826065532.2618273-1-r...@kernel.org
* add comment why ftrace_swap_func() is needed (Steve)

Since RFC: https://lore.kernel.org/all/20240411160526.2093408-1-r...@kernel.org
* update changelog about HUGE_VMAP allocations (Christophe) 
* move module_writable_address() from x86 to modules core (Ingo)
* rename execmem_invalidate() to execmem_fill_trapping_insns() (Peter)
* call alternatives_smp_unlock() after module text in-place is up to
  date (Nadav)

= Original cover letter =

These patches add support for using large ROX pages for allocations of
executable memory on x86.

They address Andy's comments [1] about having executable mappings for code
that was not completely formed.

The approach taken is to allocate ROX memory along with writable but not
executable memory and use the writable copy to perform relocations and
alternatives patching. After the module text gets into its final shape, the
contents of the writable memory is copied into the actual ROX location
using text poking.

The allocations of the ROX memory use vmalloc(VMAP_ALLOW_HUGE_MAP) to
allocate PMD aligned memory, fill that memory with invalid instructions and
in the end remap it as ROX. Portions of these large pages are handed out to
execmem_alloc() callers without any changes to the permissions. When the
memory is freed with execmem_free() it is invalidated again so that it
won't contain stale instructions.

The module memory allocation, x86 code dealing with relocations and
alternatives patching take into account the existence of the two copies,
the writable memory and the ROX memory at the actual allocated virtual
address.

The patches are available at git:
https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/x86-rox/v6

[1] 
https://lore.kernel.org/all/a17c65c6-863f-4026-9c6f-a04b659e9...@app.fastmail.com

Mike Rapoport (Microsoft) (8):
  mm: vmalloc: group declarations depending on CONFIG_MMU together
  mm: vmalloc: don't account for number of nodes for HUGE_VMAP allocations
  asm-generic: introduce text-patching.h
  module: prepare to handle ROX allocations for text
  arch: introduce set_direct_map_valid_noflush()
  x86/module: prepare module loading for ROX allocations of text
  execmem: add support for cache of large ROX pages
  x86/module: enable ROX caches for module text on 64 bit

 arch/Kconfig  |   8 +
 arch/alpha/include/asm/Kbuild |   1 +
 arch/arc/include/asm/Kbuild   |   1 +
 .../include/asm/{patch.h => text-patching.h}  |   0
 arch/arm/kernel/ftrace.c  |   2 +-
 arch/arm/kernel/jump_label.c  |   2 +-
 arch/arm/kernel/kgdb.c|   2 +-
 arch/arm/kernel/patch.c   |   2 +-
 arch/arm/probes/kprobes/core.c|   2 +-
 arch/arm/probes/kprobes/opt-arm.c |   2 +-
 arch/arm64/include/asm/set_memory.h   |   1 +
 .../asm/{patching.h => text-patching.h}   |   0
 arch/arm64/kernel/ftrace.c|   2 +-
 arch/arm64/kernel/jump_label.c|   2 +-
 arch/arm64/kernel/kgdb.c  |   2 +-
 arch/arm64/kernel/patching.c  |   2 +-
 arch/arm64/kernel/probes/kprobes.c|   2 +-
 arch/arm64/kernel/traps.c |   2 +-
 arch/arm64/mm/pageattr.c  |  10 +
 arch/arm64/net/bpf_jit_comp.c  

[PATCH v7 4/8] module: prepare to handle ROX allocations for text

2024-10-23 Thread Mike Rapoport
From: "Mike Rapoport (Microsoft)" 

In order to support ROX allocations for module text, it is necessary to
handle modifications to the code, such as relocations and alternatives
patching, without write access to that memory.

One option is to use text patching, but this would make module loading
extremely slow and will expose executable code that is not finally formed.

A better way is to have memory allocated with ROX permissions contain
invalid instructions and keep a writable, but not executable copy of the
module text. The relocations and alternative patches would be done on the
writable copy using the addresses of the ROX memory.
Once the module is completely ready, the updated text will be copied to ROX
memory using text patching in one go and the writable copy will be freed.

Add support for that to module initialization code and provide necessary
interfaces in execmem.

Signed-off-by: Mike Rapoport (Microsoft) 
Reviewd-by: Luis Chamberlain 
Tested-by: kdevops 
---
 include/linux/execmem.h| 23 +++
 include/linux/module.h | 16 
 include/linux/moduleloader.h   |  4 ++
 kernel/module/debug_kmemleak.c |  3 +-
 kernel/module/main.c   | 74 ++
 kernel/module/strict_rwx.c |  3 ++
 mm/execmem.c   | 11 +
 7 files changed, 126 insertions(+), 8 deletions(-)

diff --git a/include/linux/execmem.h b/include/linux/execmem.h
index 32cef1144117..dfdf19f8a5e8 100644
--- a/include/linux/execmem.h
+++ b/include/linux/execmem.h
@@ -46,9 +46,11 @@ enum execmem_type {
 /**
  * enum execmem_range_flags - options for executable memory allocations
  * @EXECMEM_KASAN_SHADOW:  allocate kasan shadow
+ * @EXECMEM_ROX_CACHE: allocations should use ROX cache of huge pages
  */
 enum execmem_range_flags {
EXECMEM_KASAN_SHADOW= (1 << 0),
+   EXECMEM_ROX_CACHE   = (1 << 1),
 };
 
 /**
@@ -123,6 +125,27 @@ void *execmem_alloc(enum execmem_type type, size_t size);
  */
 void execmem_free(void *ptr);
 
+/**
+ * execmem_update_copy - copy an update to executable memory
+ * @dst:  destination address to update
+ * @src:  source address containing the data
+ * @size: how many bytes of memory shold be copied
+ *
+ * Copy @size bytes from @src to @dst using text poking if the memory at
+ * @dst is read-only.
+ *
+ * Return: a pointer to @dst or NULL on error
+ */
+void *execmem_update_copy(void *dst, const void *src, size_t size);
+
+/**
+ * execmem_is_rox - check if execmem is read-only
+ * @type - the execmem type to check
+ *
+ * Return: %true if the @type is read-only, %false if it's writable
+ */
+bool execmem_is_rox(enum execmem_type type);
+
 #if defined(CONFIG_EXECMEM) && !defined(CONFIG_ARCH_WANTS_EXECMEM_LATE)
 void execmem_init(void);
 #else
diff --git a/include/linux/module.h b/include/linux/module.h
index 88ecc5e9f523..2a9386cbdf85 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -367,6 +367,8 @@ enum mod_mem_type {
 
 struct module_memory {
void *base;
+   void *rw_copy;
+   bool is_rox;
unsigned int size;
 
 #ifdef CONFIG_MODULES_TREE_LOOKUP
@@ -767,6 +769,15 @@ static inline bool is_livepatch_module(struct module *mod)
 
 void set_module_sig_enforced(void);
 
+void *__module_writable_address(struct module *mod, void *loc);
+
+static inline void *module_writable_address(struct module *mod, void *loc)
+{
+   if (!IS_ENABLED(CONFIG_ARCH_HAS_EXECMEM_ROX) || !mod)
+   return loc;
+   return __module_writable_address(mod, loc);
+}
+
 #else /* !CONFIG_MODULES... */
 
 static inline struct module *__module_address(unsigned long addr)
@@ -874,6 +885,11 @@ static inline bool module_is_coming(struct module *mod)
 {
return false;
 }
+
+static inline void *module_writable_address(struct module *mod, void *loc)
+{
+   return loc;
+}
 #endif /* CONFIG_MODULES */
 
 #ifdef CONFIG_SYSFS
diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index e395461d59e5..1f5507ba5a12 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -108,6 +108,10 @@ int module_finalize(const Elf_Ehdr *hdr,
const Elf_Shdr *sechdrs,
struct module *mod);
 
+int module_post_finalize(const Elf_Ehdr *hdr,
+const Elf_Shdr *sechdrs,
+struct module *mod);
+
 #ifdef CONFIG_MODULES
 void flush_module_init_free_work(void);
 #else
diff --git a/kernel/module/debug_kmemleak.c b/kernel/module/debug_kmemleak.c
index b4cc03842d70..df873dad049d 100644
--- a/kernel/module/debug_kmemleak.c
+++ b/kernel/module/debug_kmemleak.c
@@ -14,7 +14,8 @@ void kmemleak_load_module(const struct module *mod,
 {
/* only scan writable, non-executable sections */
for_each_mod_mem_type(type) {
-   if (type != MOD_DATA && type != MOD_INIT_DATA)
+   if (type != MOD_DATA && type != MOD_INIT_DATA &&
+   !mod->mem[type].is_r

[PATCH v7 1/8] mm: vmalloc: group declarations depending on CONFIG_MMU together

2024-10-23 Thread Mike Rapoport
From: "Mike Rapoport (Microsoft)" 

There are a couple of declarations that depend on CONFIG_MMU in
include/linux/vmalloc.h spread all over the file.

Group them all together to improve code readability.

No functional changes.

Signed-off-by: Mike Rapoport (Microsoft) 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Uladzislau Rezki (Sony) 
Reviewed-by: Luis Chamberlain 
Tested-by: kdevops 
---
 include/linux/vmalloc.h | 60 +
 1 file changed, 24 insertions(+), 36 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index ad2ce7a6ab7a..27408f21e501 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -134,12 +134,6 @@ extern void vm_unmap_ram(const void *mem, unsigned int 
count);
 extern void *vm_map_ram(struct page **pages, unsigned int count, int node);
 extern void vm_unmap_aliases(void);
 
-#ifdef CONFIG_MMU
-extern unsigned long vmalloc_nr_pages(void);
-#else
-static inline unsigned long vmalloc_nr_pages(void) { return 0; }
-#endif
-
 extern void *vmalloc_noprof(unsigned long size) __alloc_size(1);
 #define vmalloc(...)   alloc_hooks(vmalloc_noprof(__VA_ARGS__))
 
@@ -266,12 +260,29 @@ static inline bool is_vm_area_hugepages(const void *addr)
 #endif
 }
 
+/* for /proc/kcore */
+long vread_iter(struct iov_iter *iter, const char *addr, size_t count);
+
+/*
+ * Internals.  Don't use..
+ */
+__init void vm_area_add_early(struct vm_struct *vm);
+__init void vm_area_register_early(struct vm_struct *vm, size_t align);
+
+int register_vmap_purge_notifier(struct notifier_block *nb);
+int unregister_vmap_purge_notifier(struct notifier_block *nb);
+
 #ifdef CONFIG_MMU
+#define VMALLOC_TOTAL (VMALLOC_END - VMALLOC_START)
+
+unsigned long vmalloc_nr_pages(void);
+
 int vm_area_map_pages(struct vm_struct *area, unsigned long start,
  unsigned long end, struct page **pages);
 void vm_area_unmap_pages(struct vm_struct *area, unsigned long start,
 unsigned long end);
 void vunmap_range(unsigned long addr, unsigned long end);
+
 static inline void set_vm_flush_reset_perms(void *addr)
 {
struct vm_struct *vm = find_vm_area(addr);
@@ -279,24 +290,14 @@ static inline void set_vm_flush_reset_perms(void *addr)
if (vm)
vm->flags |= VM_FLUSH_RESET_PERMS;
 }
+#else  /* !CONFIG_MMU */
+#define VMALLOC_TOTAL 0UL
 
-#else
-static inline void set_vm_flush_reset_perms(void *addr)
-{
-}
-#endif
-
-/* for /proc/kcore */
-extern long vread_iter(struct iov_iter *iter, const char *addr, size_t count);
-
-/*
- * Internals.  Don't use..
- */
-extern __init void vm_area_add_early(struct vm_struct *vm);
-extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
+static inline unsigned long vmalloc_nr_pages(void) { return 0; }
+static inline void set_vm_flush_reset_perms(void *addr) {}
+#endif /* CONFIG_MMU */
 
-#ifdef CONFIG_SMP
-# ifdef CONFIG_MMU
+#if defined(CONFIG_MMU) && defined(CONFIG_SMP)
 struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 const size_t *sizes, int nr_vms,
 size_t align);
@@ -311,22 +312,9 @@ pcpu_get_vm_areas(const unsigned long *offsets,
return NULL;
 }
 
-static inline void
-pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms)
-{
-}
-# endif
-#endif
-
-#ifdef CONFIG_MMU
-#define VMALLOC_TOTAL (VMALLOC_END - VMALLOC_START)
-#else
-#define VMALLOC_TOTAL 0UL
+static inline void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms) {}
 #endif
 
-int register_vmap_purge_notifier(struct notifier_block *nb);
-int unregister_vmap_purge_notifier(struct notifier_block *nb);
-
 #if defined(CONFIG_MMU) && defined(CONFIG_PRINTK)
 bool vmalloc_dump_obj(void *object);
 #else
-- 
2.43.0


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[PATCH v7 2/8] mm: vmalloc: don't account for number of nodes for HUGE_VMAP allocations

2024-10-23 Thread Mike Rapoport
From: "Mike Rapoport (Microsoft)" 

vmalloc allocations with VM_ALLOW_HUGE_VMAP that do not explicitly
specify node ID will use huge pages only if size_per_node is larger than
a huge page.
Still the actual allocated memory is not distributed between nodes and
there is no advantage in such approach.
On the contrary, BPF allocates SZ_2M * num_possible_nodes() for each
new bpf_prog_pack, while it could do with a single huge page per pack.

Don't account for number of nodes for VM_ALLOW_HUGE_VMAP with
NUMA_NO_NODE and use huge pages whenever the requested allocation size
is larger than a huge page.

Signed-off-by: Mike Rapoport (Microsoft) 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Uladzislau Rezki (Sony) 
Reviewed-by: Luis Chamberlain 
Tested-by: kdevops 
---
 mm/vmalloc.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 634162271c00..86b2344d7461 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3763,8 +3763,6 @@ void *__vmalloc_node_range_noprof(unsigned long size, 
unsigned long align,
}
 
if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) {
-   unsigned long size_per_node;
-
/*
 * Try huge pages. Only try for PAGE_KERNEL allocations,
 * others like modules don't yet expect huge pages in
@@ -3772,13 +3770,10 @@ void *__vmalloc_node_range_noprof(unsigned long size, 
unsigned long align,
 * supporting them.
 */
 
-   size_per_node = size;
-   if (node == NUMA_NO_NODE)
-   size_per_node /= num_online_nodes();
-   if (arch_vmap_pmd_supported(prot) && size_per_node >= PMD_SIZE)
+   if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
shift = PMD_SHIFT;
else
-   shift = arch_vmap_pte_supported_shift(size_per_node);
+   shift = arch_vmap_pte_supported_shift(size);
 
align = max(real_align, 1UL << shift);
size = ALIGN(real_size, 1UL << shift);
-- 
2.43.0


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[PATCH v7 3/8] asm-generic: introduce text-patching.h

2024-10-23 Thread Mike Rapoport
From: "Mike Rapoport (Microsoft)" 

Several architectures support text patching, but they name the header
files that declare patching functions differently.

Make all such headers consistently named text-patching.h and add an empty
header in asm-generic for architectures that do not support text patching.

Signed-off-by: Mike Rapoport (Microsoft) 
Reviewed-by: Christoph Hellwig 
Acked-by: Geert Uytterhoeven  # m68k
Acked-by: Arnd Bergmann 
Reviewed-by: Luis Chamberlain 
Tested-by: kdevops 
---
 arch/alpha/include/asm/Kbuild |  1 +
 arch/arc/include/asm/Kbuild   |  1 +
 arch/arm/include/asm/{patch.h => text-patching.h} |  0
 arch/arm/kernel/ftrace.c  |  2 +-
 arch/arm/kernel/jump_label.c  |  2 +-
 arch/arm/kernel/kgdb.c|  2 +-
 arch/arm/kernel/patch.c   |  2 +-
 arch/arm/probes/kprobes/core.c|  2 +-
 arch/arm/probes/kprobes/opt-arm.c |  2 +-
 .../include/asm/{patching.h => text-patching.h}   |  0
 arch/arm64/kernel/ftrace.c|  2 +-
 arch/arm64/kernel/jump_label.c|  2 +-
 arch/arm64/kernel/kgdb.c  |  2 +-
 arch/arm64/kernel/patching.c  |  2 +-
 arch/arm64/kernel/probes/kprobes.c|  2 +-
 arch/arm64/kernel/traps.c |  2 +-
 arch/arm64/net/bpf_jit_comp.c |  2 +-
 arch/csky/include/asm/Kbuild  |  1 +
 arch/hexagon/include/asm/Kbuild   |  1 +
 arch/loongarch/include/asm/Kbuild |  1 +
 arch/m68k/include/asm/Kbuild  |  1 +
 arch/microblaze/include/asm/Kbuild|  1 +
 arch/mips/include/asm/Kbuild  |  1 +
 arch/nios2/include/asm/Kbuild |  1 +
 arch/openrisc/include/asm/Kbuild  |  1 +
 .../include/asm/{patch.h => text-patching.h}  |  0
 arch/parisc/kernel/ftrace.c   |  2 +-
 arch/parisc/kernel/jump_label.c   |  2 +-
 arch/parisc/kernel/kgdb.c |  2 +-
 arch/parisc/kernel/kprobes.c  |  2 +-
 arch/parisc/kernel/patch.c|  2 +-
 arch/powerpc/include/asm/kprobes.h|  2 +-
 .../asm/{code-patching.h => text-patching.h}  |  0
 arch/powerpc/kernel/crash_dump.c  |  2 +-
 arch/powerpc/kernel/epapr_paravirt.c  |  2 +-
 arch/powerpc/kernel/jump_label.c  |  2 +-
 arch/powerpc/kernel/kgdb.c|  2 +-
 arch/powerpc/kernel/kprobes.c |  2 +-
 arch/powerpc/kernel/module_32.c   |  2 +-
 arch/powerpc/kernel/module_64.c   |  2 +-
 arch/powerpc/kernel/optprobes.c   |  2 +-
 arch/powerpc/kernel/process.c |  2 +-
 arch/powerpc/kernel/security.c|  2 +-
 arch/powerpc/kernel/setup_32.c|  2 +-
 arch/powerpc/kernel/setup_64.c|  2 +-
 arch/powerpc/kernel/static_call.c |  2 +-
 arch/powerpc/kernel/trace/ftrace.c|  2 +-
 arch/powerpc/kernel/trace/ftrace_64_pg.c  |  2 +-
 arch/powerpc/lib/code-patching.c  |  2 +-
 arch/powerpc/lib/feature-fixups.c |  2 +-
 arch/powerpc/lib/test-code-patching.c |  2 +-
 arch/powerpc/lib/test_emulate_step.c  |  2 +-
 arch/powerpc/mm/book3s32/mmu.c|  2 +-
 arch/powerpc/mm/book3s64/hash_utils.c |  2 +-
 arch/powerpc/mm/book3s64/slb.c|  2 +-
 arch/powerpc/mm/kasan/init_32.c   |  2 +-
 arch/powerpc/mm/mem.c |  2 +-
 arch/powerpc/mm/nohash/44x.c  |  2 +-
 arch/powerpc/mm/nohash/book3e_pgtable.c   |  2 +-
 arch/powerpc/mm/nohash/tlb.c  |  2 +-
 arch/powerpc/mm/nohash/tlb_64e.c  |  2 +-
 arch/powerpc/net/bpf_jit_comp.c   |  2 +-
 arch/powerpc/perf/8xx-pmu.c   |  2 +-
 arch/powerpc/perf/core-book3s.c   |  2 +-
 arch/powerpc/platforms/85xx/smp.c |  2 +-
 arch/powerpc/platforms/86xx/mpc86xx_smp.c |  2 +-
 arch/powerpc/platforms/cell/smp.c |  2 +-
 arch/powerpc/platforms/powermac/smp.c |  2 +-
 arch/powerpc/platforms/powernv/idle.c |  2 +-
 arch/powerpc/platforms/powernv/smp.c  |  2 +-
 arch/powerpc/platforms/pseries/smp.c  |  2 +-
 arch/powerpc/xmon/xmon.c  |  2 +-
 arch/riscv/errata/andes/errata.c  |  2 +-
 arch/riscv/errata/sifive/errata.c |  2 +-
 arch/riscv/errata/thead/errata.c  |  2 +-
 .../include/asm/{patch.h => text-patching.h}  |  0
 arch/riscv/include/asm/uprobes.h

[PATCH v7 6/8] x86/module: prepare module loading for ROX allocations of text

2024-10-23 Thread Mike Rapoport
From: "Mike Rapoport (Microsoft)" 

When module text memory will be allocated with ROX permissions, the
memory at the actual address where the module will live will contain
invalid instructions and there will be a writable copy that contains the
actual module code.

Update relocations and alternatives patching to deal with it.

Signed-off-by: Mike Rapoport (Microsoft) 
Tested-by: kdevops 
---
 arch/um/kernel/um_arch.c   |  11 +-
 arch/x86/entry/vdso/vma.c  |   3 +-
 arch/x86/include/asm/alternative.h |  14 +--
 arch/x86/kernel/alternative.c  | 181 +
 arch/x86/kernel/ftrace.c   |  30 ++---
 arch/x86/kernel/module.c   |  45 ---
 6 files changed, 167 insertions(+), 117 deletions(-)

diff --git a/arch/um/kernel/um_arch.c b/arch/um/kernel/um_arch.c
index f8de31a0c5d1..e8e8b54b3037 100644
--- a/arch/um/kernel/um_arch.c
+++ b/arch/um/kernel/um_arch.c
@@ -435,24 +435,25 @@ void __init arch_cpu_finalize_init(void)
os_check_bugs();
 }
 
-void apply_seal_endbr(s32 *start, s32 *end)
+void apply_seal_endbr(s32 *start, s32 *end, struct module *mod)
 {
 }
 
-void apply_retpolines(s32 *start, s32 *end)
+void apply_retpolines(s32 *start, s32 *end, struct module *mod)
 {
 }
 
-void apply_returns(s32 *start, s32 *end)
+void apply_returns(s32 *start, s32 *end, struct module *mod)
 {
 }
 
 void apply_fineibt(s32 *start_retpoline, s32 *end_retpoline,
-  s32 *start_cfi, s32 *end_cfi)
+  s32 *start_cfi, s32 *end_cfi, struct module *mod)
 {
 }
 
-void apply_alternatives(struct alt_instr *start, struct alt_instr *end)
+void apply_alternatives(struct alt_instr *start, struct alt_instr *end,
+   struct module *mod)
 {
 }
 
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index b8fed8b8b9cc..ed21151923c3 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -54,7 +54,8 @@ int __init init_vdso_image(const struct vdso_image *image)
 
apply_alternatives((struct alt_instr *)(image->data + image->alt),
   (struct alt_instr *)(image->data + image->alt +
-   image->alt_len));
+   image->alt_len),
+  NULL);
 
return 0;
 }
diff --git a/arch/x86/include/asm/alternative.h 
b/arch/x86/include/asm/alternative.h
index ca9ae606aab9..dc03a647776d 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -96,16 +96,16 @@ extern struct alt_instr __alt_instructions[], 
__alt_instructions_end[];
  * instructions were patched in already:
  */
 extern int alternatives_patched;
+struct module;
 
 extern void alternative_instructions(void);
-extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end);
-extern void apply_retpolines(s32 *start, s32 *end);
-extern void apply_returns(s32 *start, s32 *end);
-extern void apply_seal_endbr(s32 *start, s32 *end);
+extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end,
+  struct module *mod);
+extern void apply_retpolines(s32 *start, s32 *end, struct module *mod);
+extern void apply_returns(s32 *start, s32 *end, struct module *mod);
+extern void apply_seal_endbr(s32 *start, s32 *end, struct module *mod);
 extern void apply_fineibt(s32 *start_retpoline, s32 *end_retpoine,
- s32 *start_cfi, s32 *end_cfi);
-
-struct module;
+ s32 *start_cfi, s32 *end_cfi, struct module *mod);
 
 struct callthunk_sites {
s32 *call_start, *call_end;
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index d17518ca19b8..3407efc26528 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -392,8 +392,10 @@ EXPORT_SYMBOL(BUG_func);
  * Rewrite the "call BUG_func" replacement to point to the target of the
  * indirect pv_ops call "call *disp(%ip)".
  */
-static int alt_replace_call(u8 *instr, u8 *insn_buff, struct alt_instr *a)
+static int alt_replace_call(u8 *instr, u8 *insn_buff, struct alt_instr *a,
+   struct module *mod)
 {
+   u8 *wr_instr = module_writable_address(mod, instr);
void *target, *bug = &BUG_func;
s32 disp;
 
@@ -403,14 +405,14 @@ static int alt_replace_call(u8 *instr, u8 *insn_buff, 
struct alt_instr *a)
}
 
if (a->instrlen != 6 ||
-   instr[0] != CALL_RIP_REL_OPCODE ||
-   instr[1] != CALL_RIP_REL_MODRM) {
+   wr_instr[0] != CALL_RIP_REL_OPCODE ||
+   wr_instr[1] != CALL_RIP_REL_MODRM) {
pr_err("ALT_FLAG_DIRECT_CALL set for unrecognized indirect 
call\n");
BUG();
}
 
/* Skip CALL_RIP_REL_OPCODE and CALL_RIP_REL_MODRM */
-   disp = *(s32 *)(instr + 2);
+   disp = *(s32 *)(wr_instr + 2);
 #ifdef CONFIG_X86_64
/* ff 15 00 00 00

[PATCH v7 7/8] execmem: add support for cache of large ROX pages

2024-10-23 Thread Mike Rapoport
From: "Mike Rapoport (Microsoft)" 

Using large pages to map text areas reduces iTLB pressure and improves
performance.

Extend execmem_alloc() with an ability to use huge pages with ROX
permissions as a cache for smaller allocations.

To populate the cache, a writable large page is allocated from vmalloc with
VM_ALLOW_HUGE_VMAP, filled with invalid instructions and then remapped as
ROX.

The direct map alias of that large page is exculded from the direct map.

Portions of that large page are handed out to execmem_alloc() callers
without any changes to the permissions.

When the memory is freed with execmem_free() it is invalidated again so
that it won't contain stale instructions.

An architecture has to implement execmem_fill_trapping_insns() callback
and select ARCH_HAS_EXECMEM_ROX configuration option to be able to use
the ROX cache.

The cache is enabled on per-range basis when an architecture sets
EXECMEM_ROX_CACHE flag in definition of an execmem_range.

Signed-off-by: Mike Rapoport (Microsoft) 
Reviewed-by: Luis Chamberlain 
Tested-by: kdevops 
---
 arch/Kconfig|   8 +
 include/linux/execmem.h |  14 ++
 mm/execmem.c| 325 +++-
 mm/internal.h   |   1 +
 mm/vmalloc.c|   5 +
 5 files changed, 345 insertions(+), 8 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 98157b38f5cf..f4f6e170eb7e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1010,6 +1010,14 @@ config ARCH_WANTS_EXECMEM_LATE
  enough entropy for module space randomization, for instance
  arm64.
 
+config ARCH_HAS_EXECMEM_ROX
+   bool
+   depends on MMU && !HIGHMEM
+   help
+ For architectures that support allocations of executable memory
+ with read-only execute permissions. Architecture must implement
+ execmem_fill_trapping_insns() callback to enable this.
+
 config HAVE_IRQ_EXIT_ON_IRQ_STACK
bool
help
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
index dfdf19f8a5e8..1517fa196bf7 100644
--- a/include/linux/execmem.h
+++ b/include/linux/execmem.h
@@ -53,6 +53,20 @@ enum execmem_range_flags {
EXECMEM_ROX_CACHE   = (1 << 1),
 };
 
+#ifdef CONFIG_ARCH_HAS_EXECMEM_ROX
+/**
+ * execmem_fill_trapping_insns - set memory to contain instructions that
+ *  will trap
+ * @ptr:   pointer to memory to fill
+ * @size:  size of the range to fill
+ * @writable:  is the memory poited by @ptr is writable or ROX
+ *
+ * A hook for architecures to fill execmem ranges with invalid instructions.
+ * Architectures that use EXECMEM_ROX_CACHE must implement this.
+ */
+void execmem_fill_trapping_insns(void *ptr, size_t size, bool writable);
+#endif
+
 /**
  * struct execmem_range - definition of an address space suitable for code and
  *   related data allocations
diff --git a/mm/execmem.c b/mm/execmem.c
index 0f6691e9ffe6..576a57e2161f 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -6,29 +6,41 @@
  * Copyright (C) 2024 Mike Rapoport IBM.
  */
 
+#define pr_fmt(fmt) "execmem: " fmt
+
 #include 
+#include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 
+#include 
+
+#include "internal.h"
+
 static struct execmem_info *execmem_info __ro_after_init;
 static struct execmem_info default_execmem_info __ro_after_init;
 
-static void *__execmem_alloc(struct execmem_range *range, size_t size)
+#ifdef CONFIG_MMU
+static void *execmem_vmalloc(struct execmem_range *range, size_t size,
+pgprot_t pgprot, unsigned long vm_flags)
 {
bool kasan = range->flags & EXECMEM_KASAN_SHADOW;
-   unsigned long vm_flags  = VM_FLUSH_RESET_PERMS;
gfp_t gfp_flags = GFP_KERNEL | __GFP_NOWARN;
+   unsigned int align = range->alignment;
unsigned long start = range->start;
unsigned long end = range->end;
-   unsigned int align = range->alignment;
-   pgprot_t pgprot = range->pgprot;
void *p;
 
if (kasan)
vm_flags |= VM_DEFER_KMEMLEAK;
 
+   if (vm_flags & VM_ALLOW_HUGE_VMAP)
+   align = PMD_SIZE;
+
p = __vmalloc_node_range(size, align, start, end, gfp_flags,
 pgprot, vm_flags, NUMA_NO_NODE,
 __builtin_return_address(0));
@@ -41,7 +53,7 @@ static void *__execmem_alloc(struct execmem_range *range, 
size_t size)
}
 
if (!p) {
-   pr_warn_ratelimited("execmem: unable to allocate memory\n");
+   pr_warn_ratelimited("unable to allocate memory\n");
return NULL;
}
 
@@ -50,14 +62,298 @@ static void *__execmem_alloc(struct execmem_range *range, 
size_t size)
return NULL;
}
 
-   return kasan_reset_tag(p);
+   return p;
 }
+#else
+static void *execmem_vmalloc(struct execmem_range *range, size_t size,
+pgprot_t pgprot, unsigned long vm

[PATCH v7 5/8] arch: introduce set_direct_map_valid_noflush()

2024-10-23 Thread Mike Rapoport
From: "Mike Rapoport (Microsoft)" 

Add an API that will allow updates of the direct/linear map for a set of
physically contiguous pages.

It will be used in the following patches.

Signed-off-by: Mike Rapoport (Microsoft) 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Luis Chamberlain 
Tested-by: kdevops 
---
 arch/arm64/include/asm/set_memory.h |  1 +
 arch/arm64/mm/pageattr.c| 10 ++
 arch/loongarch/include/asm/set_memory.h |  1 +
 arch/loongarch/mm/pageattr.c| 19 +++
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/mm/pageattr.c| 15 +++
 arch/s390/include/asm/set_memory.h  |  1 +
 arch/s390/mm/pageattr.c | 11 +++
 arch/x86/include/asm/set_memory.h   |  1 +
 arch/x86/mm/pat/set_memory.c|  8 
 include/linux/set_memory.h  |  6 ++
 11 files changed, 74 insertions(+)

diff --git a/arch/arm64/include/asm/set_memory.h 
b/arch/arm64/include/asm/set_memory.h
index 917761feeffd..98088c043606 100644
--- a/arch/arm64/include/asm/set_memory.h
+++ b/arch/arm64/include/asm/set_memory.h
@@ -13,6 +13,7 @@ int set_memory_valid(unsigned long addr, int numpages, int 
enable);
 
 int set_direct_map_invalid_noflush(struct page *page);
 int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid);
 bool kernel_page_present(struct page *page);
 
 #endif /* _ASM_ARM64_SET_MEMORY_H */
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 0e270a1c51e6..01225900293a 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -192,6 +192,16 @@ int set_direct_map_default_noflush(struct page *page)
   PAGE_SIZE, change_page_range, &data);
 }
 
+int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
+{
+   unsigned long addr = (unsigned long)page_address(page);
+
+   if (!can_set_direct_map())
+   return 0;
+
+   return set_memory_valid(addr, nr, valid);
+}
+
 #ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
diff --git a/arch/loongarch/include/asm/set_memory.h 
b/arch/loongarch/include/asm/set_memory.h
index d70505b6676c..55dfaefd02c8 100644
--- a/arch/loongarch/include/asm/set_memory.h
+++ b/arch/loongarch/include/asm/set_memory.h
@@ -17,5 +17,6 @@ int set_memory_rw(unsigned long addr, int numpages);
 bool kernel_page_present(struct page *page);
 int set_direct_map_default_noflush(struct page *page);
 int set_direct_map_invalid_noflush(struct page *page);
+int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid);
 
 #endif /* _ASM_LOONGARCH_SET_MEMORY_H */
diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c
index ffd8d76021d4..bf8678248444 100644
--- a/arch/loongarch/mm/pageattr.c
+++ b/arch/loongarch/mm/pageattr.c
@@ -216,3 +216,22 @@ int set_direct_map_invalid_noflush(struct page *page)
 
return __set_memory(addr, 1, __pgprot(0), __pgprot(_PAGE_PRESENT | 
_PAGE_VALID));
 }
+
+int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
+{
+   unsigned long addr = (unsigned long)page_address(page);
+   pgprot_t set, clear;
+
+   if (addr < vm_map_base)
+   return 0;
+
+   if (valid) {
+   set = PAGE_KERNEL;
+   clear = __pgprot(0);
+   } else {
+   set = __pgprot(0);
+   clear = __pgprot(_PAGE_PRESENT | _PAGE_VALID);
+   }
+
+   return __set_memory(addr, 1, set, clear);
+}
diff --git a/arch/riscv/include/asm/set_memory.h 
b/arch/riscv/include/asm/set_memory.h
index ab92fc84e1fc..ea263d3683ef 100644
--- a/arch/riscv/include/asm/set_memory.h
+++ b/arch/riscv/include/asm/set_memory.h
@@ -42,6 +42,7 @@ static inline int set_kernel_memory(char *startp, char *endp,
 
 int set_direct_map_invalid_noflush(struct page *page);
 int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid);
 bool kernel_page_present(struct page *page);
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index 271d01a5ba4d..d815448758a1 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -386,6 +386,21 @@ int set_direct_map_default_noflush(struct page *page)
PAGE_KERNEL, __pgprot(_PAGE_EXEC));
 }
 
+int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
+{
+   pgprot_t set, clear;
+
+   if (valid) {
+   set = PAGE_KERNEL;
+   clear = __pgprot(_PAGE_EXEC);
+   } else {
+   set = __pgprot(0);
+   clear = __pgprot(_PAGE_PRESENT);
+   }
+
+   return __set_memory((unsigned long)page_address(page), nr, set, clear);
+}
+
 #ifdef CONFIG_DEBUG_PAGEALLOC
 static int debug_pagealloc_set_pa

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-10-23 Thread Liam R. Howlett
* Steven Price  [241023 05:31]:
> >>   * Box64 seems to have a custom allocator based on reading 
> >> /proc/self/maps to allocate a block of VA space with a low enough 
> >> address [1]
> >>
> >>   * PHP has code reading /proc/self/maps - I think this is to find a 
> >> segment which is close enough to the text segment [2]
> >>
> >>   * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit
> >> addresses [3][4]
> > 
> > Can't the limited number of applications that need to restrict the upper
> > bound use an LD_PRELOAD compatible library to do this?
> 
> I'm not entirely sure what point you are making here. Yes an LD_PRELOAD
> approach could be used instead of a personality type as a 'hack' to
> preallocate the upper address space. The obvious disadvantage is that
> you can't (easily) layer LD_PRELOAD so it won't work in the general case.

My point is that riscv could work around the limited number of
applications that requires this.  It's not really viable for you.

> 
> >>
> >>   * pmdk has some funky code to find the lowest address that meets 
> >> certain requirements - this does look like an ALSR alternative and 
> >> probably couldn't directly use MAP_BELOW_HINT, although maybe this 
> >> suggests we need a mechanism to map without a VA-range? [5]
> >>
> >>   * MIT-Scheme parses /proc/self/maps to find the lowest mapping within 
> >> a range [6]
> >>
> >>   * LuaJIT uses an approach to 'probe' to find a suitable low address 
> >> for allocation [7]
> >>
> > 
> > Although I did not take a deep dive into each example above, there are
> > some very odd things being done, we will never cover all the use cases
> > with an exact API match.  What we have today can be made to work for
> > these users as they have figured ways to do it.
> > 
> > Are they pretty? no.  Are they common? no.  I'm not sure it's worth
> > plumbing in new MM code in for these users.
> 
> My issue with the existing 'solutions' is that they all seem to be fragile:
> 
>  * Using /proc/self/maps is inherently racy if there could be any other
> code running in the process at the same time.

Yes, it is not thread safe.  Parsing text is also undesirable.

> 
>  * Attempting to map the upper part of the address space only works if
> done early enough - once an allocation arrives there, there's very
> little you can robustly do (because the stray allocation might be freed).
> 
>  * LuaJIT's probing mechanism is probably robust, but it's inefficient -
> LuaJIT has a fallback of linear probing, following by no hint (ASLR),
> followed by pseudo-random probing. I don't know the history of the code
> but it looks like it's probably been tweaked to try to avoid performance
> issues.
> 
> >> The biggest benefit I see of MAP_BELOW_HINT is that it would allow a
> >> library to get low addresses without causing any problems for the rest
> >> of the application. The use case I'm looking at is in a library and 
> >> therefore a personality mode wouldn't be appropriate (because I don't 
> >> want to affect the rest of the application). Reading /proc/self/maps
> >> is also problematic because other threads could be allocating/freeing
> >> at the same time.
> > 
> > As long as you don't exhaust the lower limit you are trying to allocate
> > within - which is exactly the issue riscv is hitting.
> 
> Obviously if you actually exhaust the lower limit then any
> MAP_BELOW_HINT API would also fail - there's really not much that can be
> done in that case.

Today we reverse the search, so you end up in the higher address
(bottom-up vs top-down) - although the direction is arch dependent.

If the allocation is too high/low then you could detect, free, and
handle the failure.

> 
> > I understand that you are providing examples to prove that this is
> > needed, but I feel like you are better demonstrating the flexibility
> > exists to implement solutions in different ways using todays API.
> 
> My intention is to show that today's API doesn't provide a robust way of
> doing this. Although I'm quite happy if you can point me at a robust way
> with the current API. As I mentioned my goal is to be able to map memory
> in a (multithreaded) library with a (ideally configurable) number of VA
> bits. I don't particularly want to restrict the whole process, just
> specific allocations.

If you don't need to restrict everything, won't the hint work for your
usecase?  I must be missing something from your requirements.

> 
> I had typed up a series similar to this one as a MAP_BELOW flag would
> fit my use-case well.
> 
> > I think it would be best to use the existing methods and work around the
> > issue that was created in riscv while future changes could mirror amd64
> > and arm64.
> 
> The riscv issue is a different issue to the one I'm trying to solve. I
> agree MAP_BELOW_HINT isn't a great fix for that because we already have
> differences between amd64 and arm64 and obviously no software currently
> out there uses this

[PATCH v7 8/8] x86/module: enable ROX caches for module text on 64 bit

2024-10-23 Thread Mike Rapoport
From: "Mike Rapoport (Microsoft)" 

Enable execmem's cache of PMD_SIZE'ed pages mapped as ROX for module
text allocations on 64 bit.

Signed-off-by: Mike Rapoport (Microsoft) 
Reviewed-by: Luis Chamberlain 
Tested-by: kdevops 
---
 arch/x86/Kconfig   |  1 +
 arch/x86/mm/init.c | 37 -
 2 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2852fcd82cbd..ff71d18253ba 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -83,6 +83,7 @@ config X86
select ARCH_HAS_DMA_OPS if GART_IOMMU || XEN
select ARCH_HAS_EARLY_DEBUG if KGDB
select ARCH_HAS_ELF_RANDOMIZE
+   select ARCH_HAS_EXECMEM_ROX if X86_64
select ARCH_HAS_FAST_MULTIPLIER
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index eb503f53c319..c2e4f389f47f 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1053,18 +1053,53 @@ unsigned long arch_max_swapfile_size(void)
 #ifdef CONFIG_EXECMEM
 static struct execmem_info execmem_info __ro_after_init;
 
+#ifdef CONFIG_ARCH_HAS_EXECMEM_ROX
+void execmem_fill_trapping_insns(void *ptr, size_t size, bool writeable)
+{
+   /* fill memory with INT3 instructions */
+   if (writeable)
+   memset(ptr, INT3_INSN_OPCODE, size);
+   else
+   text_poke_set(ptr, INT3_INSN_OPCODE, size);
+}
+#endif
+
 struct execmem_info __init *execmem_arch_setup(void)
 {
unsigned long start, offset = 0;
+   enum execmem_range_flags flags;
+   pgprot_t pgprot;
 
if (kaslr_enabled())
offset = get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
 
start = MODULES_VADDR + offset;
 
+   if (IS_ENABLED(CONFIG_ARCH_HAS_EXECMEM_ROX)) {
+   pgprot = PAGE_KERNEL_ROX;
+   flags = EXECMEM_KASAN_SHADOW | EXECMEM_ROX_CACHE;
+   } else {
+   pgprot = PAGE_KERNEL;
+   flags = EXECMEM_KASAN_SHADOW;
+   }
+
execmem_info = (struct execmem_info){
.ranges = {
-   [EXECMEM_DEFAULT] = {
+   [EXECMEM_MODULE_TEXT] = {
+   .flags  = flags,
+   .start  = start,
+   .end= MODULES_END,
+   .pgprot = pgprot,
+   .alignment = MODULE_ALIGN,
+   },
+   [EXECMEM_KPROBES ... EXECMEM_BPF] = {
+   .flags  = EXECMEM_KASAN_SHADOW,
+   .start  = start,
+   .end= MODULES_END,
+   .pgprot = PAGE_KERNEL,
+   .alignment = MODULE_ALIGN,
+   },
+   [EXECMEM_MODULE_DATA] = {
.flags  = EXECMEM_KASAN_SHADOW,
.start  = start,
.end= MODULES_END,
-- 
2.43.0


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc