> How can we/I compile the nvidia-peermem driver with Mellanox
     ib_peer_mem symbols?

   Probably not a problem of the nvidia-peermem module but of the kernel 
   (or a third-party module) that needs to provide these symbols.

First, I upgraded to Debian 12.6. So I am running nvidia-driver 535.183.01.

I then installed doca-ofed.

   
https://developer.nvidia.com/doca-downloads?deployment_platform=Host-Server&deployment_package=DOCA-Host&target_os=Linux&Architecture=x86_64&Profile=doca-ofed&Distribution=Debian&version=12.1&installer_type=deb_online

(I'll spare you the details. It did not install out of the box. If you
wish, I can tell you exactly what I did to install it.)

doca-ofed (aka MLNX_OFED) provides the necessary symbols.

Then I did this:

   dkms --force build -m nvidia-current -v 535.183.01
   dkms --force install -m nvidia-current -v 535.183.01

I did this to attempt to get nvidia-peermem to work. Indeed, now I get
a different error message:

   root@vuku:~# modprobe nvidia-peermem
   modprobe: ERROR: could not insert 'nvidia_current_peermem': Unknown symbol 
in module, or unknown parameter (see dmesg)
   modprobe: ERROR: ../libkmod/libkmod-module.c:1047 command_do() Error running 
install command 'modprobe nvidia ; modprobe -i nvidia-current-peermem ' for 
module nvidia_peermem: retcode 1
   modprobe: ERROR: could not insert 'nvidia_peermem': Invalid argument
   root@vuku:~# dmesg|tail
   [161810.093263] Compat-mlnx-ofed backport release: 91fb8cd
   [161810.093272] Backport based on 
https://:@git-nbu.nvidia.com/r/a/mlnx_ofed/mlnx-ofa_kernel-4.0.git 91fb8cd
   [161810.093274] compat.git: 
https://:@git-nbu.nvidia.com/r/a/mlnx_ofed/mlnx-ofa_kernel-4.0.git
   [161810.342539] nvidia_peermem: Unknown symbol 
ib_register_peer_memory_client (err -2)
   [161810.342554] nvidia_peermem: Unknown symbol 
ib_unregister_peer_memory_client (err -2)
   root@vuku:~# 

I think I am getting close to getting nvidia-peermem to work.

   /usr/src/nvidia-current-535.183.01/nvidia-peermem/nvidia-peermem.Kbuild

has

   OFA_DIR := /usr/src/ofa_kernel
   OFA_CANDIDATES = $(OFA_DIR)/$(OFA_ARCH)/$(KERNELRELEASE) 
$(OFA_DIR)/$(KERNELRELEASE) $(OFA_DIR)/default /var/lib/dkms/mlnx-ofed-kernel

And three of the four candidate directories exist:

   qobi@vuku>ls /usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/
   compat/           compat_base_tree_version  configure@           
Module.symvers
   compat_base       compat.config             configure.mk.kernel  
ofed_scripts/
   compat_base_tree  compat_version            include/
   qobi@vuku>ls /usr/src/ofa_kernel/6.1.0-22-amd64/
   ls: cannot access '/usr/src/ofa_kernel/6.1.0-22-amd64/': No such file or 
directory
   qobi@vuku>ls /usr/src/ofa_kernel/default
   /usr/src/ofa_kernel/default@
   qobi@vuku>ls /usr/src/ofa_kernel/default/
   compat/           compat_base_tree_version  configure@           
Module.symvers
   compat_base       compat.config             configure.mk.kernel  
ofed_scripts/
   compat_base_tree  compat_version            include/
   qobi@vuku>ls /var/lib/dkms/mlnx-ofed-kernel
   24.04.OFED.24.04.0.7.0.1/  kernel-6.1.0-22-amd64-x86_64@
   qobi@vuku>

There appear to be two different Module.symvers, but they apear to be identical:

   qobi@vuku>ls -l /usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/Module.symvers
   -rw-r--r-- 1 root root 92655 Jul  3 15:49 
/usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/Module.symvers
   qobi@vuku>ls -l /usr/src/ofa_kernel/default/Module.symvers
   -rw-r--r-- 1 root root 92655 Jul  3 15:49 
/usr/src/ofa_kernel/default/Module.symvers
   qobi@vuku>diff /usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/Module.symvers 
/usr/src/ofa_kernel/default/Module.symvers
   qobi@vuku>

And they appear to have the requiste symbols:

   qobi@vuku>fgrep ib_register_peer_memory_client 
/usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/Module.symvers
   0xaba78e45        ib_register_peer_memory_client  
/var/lib/dkms/mlnx-ofed-kernel/24.04.OFED.24.04.0.7.0.1/build/drivers/infiniband/core/ib_core
   EXPORT_SYMBOL   
   qobi@vuku>fgrep ib_unregister_peer_memory_client 
/usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/Module.symvers
   0xbde5c050    ib_unregister_peer_memory_client        
/var/lib/dkms/mlnx-ofed-kernel/24.04.OFED.24.04.0.7.0.1/build/drivers/infiniband/core/ib_core
   EXPORT_SYMBOL   
   qobi@vuku>

And the module appears to contain those symbols:

   qobi@vuku>fgrep ib_register_peer_memory_client 
/usr/lib/modules/6.1.0-22-amd64/updates/dkms/nvidia-current-peermem.ko
   grep: 
/usr/lib/modules/6.1.0-22-amd64/updates/dkms/nvidia-current-peermem.ko: binary 
file matches
   qobi@vuku>strings 
/usr/lib/modules/6.1.0-22-amd64/updates/dkms/nvidia-current-peermem.ko|fgrep 
ib_register_peer_memory_client
   ib_register_peer_memory_client
   ib_register_peer_memory_client
   qobi@vuku>

So I don't know why the module doesn't load.

Any ideas?

    Thanks,
    Jeff (http: //engineering.purdue.edu/~qobi)

Reply via email to