Re: TU Application - tpkessler

Filipe Laíns Sat, 05 Nov 2022 18:53:27 -0700

On Wed, 2022-10-26 at 06:30 +0000, Torsten Keßler wrote:
> Hi! I'm Torsten Keßler (tpkessler in AUR and on GitHub) from Saarland, a
> federal state in the south west of Germany. With this email
> I'm applying to be become a trusted user.
> After graduating with a PhD in applied mathematics this year I'm now
> a post-doc with a focus on numerical analysis, the art of solving physical
> problems with mathematically sound algorithms on a computer.
> I've been using Arch Linux on my private machines (and at work) since my
> first weeks at university ten years ago. After initial distro hopping a
> friend recommended Arch. I immediately liked the way it handles packages
> via pacman, its wiki and the flexibility of its installation process.
> 
> Owing to their massively parallel architecture, GPUs have emerged as the
> leading platform for computationally expensive problems: Machine
> Learning/AI, real-world engineering problems, simulation of complex
> physical systems. For a long time, nVidia's CUDA framework (closed
> source, exclusively for their GPUs) has dominated this field. In 2015,
> AMD announced ROCm, their open source compute framework for GPUs. A
> common interface to CUDA, called HIP, makes it possible to write code
> that compiles and runs both on AMD and nVidia hardware. I've been
> closely following the development of ROCm on GitHub, trying to compile
> the stack from time to time. But only since 2020, the kernel includes
> all the necessary code to compile the ROCm stack on Arch Linux. Around
> this time I've started to contribute to rocm-arch on GitHub, a
> collection of PKGBUILDs for ROCm (with around 50 packages). Soon after
> that, I became the main contributor to the repository and, since 2021,
> I've been the maintainer of the whole ROCm stack.
> 
> We have an active issue tracker and recently started a discussion page
> for rocm-arch. Most of the open issues as of now are for bookkeeping of
> patches we applied to run ROCm on Arch Linux. Many of them are linked to
> an upstream issue and a corresponding pull request that fixes the
> issues. This way I've already contributed code to a couple of libraries
> of the ROCm stack.
> 
> Over the years, many libraries have added official support for ROCm,
> including tensorflow, pytorch, python-cupy, python-numba (not actively
> maintained anymore) and blender. Support of ROCm for the latter
> generated large interest in the community and is one reason Sven
> contacted me, asking me if I would be interested to take care of ROCm in
> [community]. In its current version, ROCm support for blender works out
> of the box. Just install hip-runtime-amd from the AUR and enable the HIP
> backend in blender's settings for rendering. The machine learning
> libraries require more dependencies from the AUR. Once installed,
> pytorch and tensorflow are known to work on Vega GPUs and the recent
> RDNA architecture.
> 
> My first action as a TU would be to add basic support of ROCm to
> [community], i.e. the low level libraries, including HIP and an open
> source runtime for OpenCL based on ROCm. That would be enough to run
> blender with its ROCm backend. At the same time, I would expand the wiki
> article on ROCm. The interaction with the community would also move from
> the issue tracker of rocm-arch to the Arch Linux bug tracker and the
> forums. In a second phase I would add the high level libraries that
> would enable users to quickly compile and run complex libraries such as
> tensorflow, pytorch or cupy.


Huge +1 for me here. It would be awesome to bring ROCm to the official repos. I
have not done it as currently I am split between tons of projects, which makes
it hard to find the time for the initial work and then commit to maintaining the
stack, so I am very excited having someone take this item of the my endless TODO
list!

> #BEGIN Technical details
> 
> The minimal package list for HIP which includes the runtime libraries
> for basic GPU programming and the GPU compiler (hipcc) comprises eight
> packages
> 
> * rocm-cmake (basic cmake files for ROCm)
> * rocm-llvm (upstream llvm with to-be-merged changes by AMD)
> * rocm-device-libs (implements math functions for all GPU architectures)
> * comgr (runtime library, "compiler support" for rocm-llvm)
> * hsakmt-roct (interface to the amdgpu kernel driver)
> * hsa-rocr (runtime for HSA compute kernels)
> * rocminfo (display information on HSA agents: GPU and possibly CPU)
> * hip-runtime-amd (runtime and compiler for HIP, a C++ dialect inspired
> by CUDA C++)
> 
> All but rocm-llvm are small libraries under the permissive MIT license.
> Since ROCm 5.2, all packages successfully build in a clean chroot and
> are distributed in the community repo arch4edu.
> 
> The application libraries for numerical linear algebra, sparse matrices
> or random numbers start with roc and hip (rocblas, rocsparse, rocrand).
> The hip* packages are designed in such a way that they would also work
> with CUDA if hip is configured with CUDA instead of a ROCm/HSA backend.
> With few exceptions (rocthrust, rccl) these packages are licensed under MIT.
> 
> Possible issues:
> There are three packages that are not fully working under Arch Linux or
> lack an open source license. The first is rocm-gdb, a fork of gdb with
> GPU support. To work properly it needs a kernel module currently not
> available in upstream linux but only as part of AMD's dkms modules. But
> they only work with specific kernel versions. Support for this from my
> side on Arch Linux was dropped a while ago. One closed source package is
> hsa-amd-aqlprofile. As the name suggests it is used for profiling as
> part of rocprofiler. Above mentioned packages are only required for
> debugging and profiling but are no runtime dependencies of the big
> machine learning libraries or any other package with ROCm support I'm
> aware of. The third package is rocm-core, a package only part of the
> meta packages for ROCm with no influence on the ROCm runtime. It
> provides a single header and a library with a single function that
> returns the current ROCm version. No source code has been published by
> AMD so far and the official package lacks a license file.
> 
> A second issue is GPU support. AMD officially only supports the
> professional compute GPUs. This does not mean that ROCm is not working
> on consumer cards but merely that AMD cannot guarantee all
> functionalities through excessive testing. Recently, ROCm added support
> for Navi 21  (RX 6800 onwards), see
> 
> https://docs.amd.com/bundle/Hardware_and_Software_Reference_Guide/page/Hardware_and_Software_Support.html
> 
> I own a Vega 56 (gfx900) that is officially supported, so I can test all
> packages before publishing them on the AUR / in [community].

I own a RX 5700 XT (gfx1010), if specific testing is required.

> #END Technical details
> 
> On the long term, I would like to foster Arch Linux as the leading
> platform for scientific computing. This includes Machine Learning
> libraries in the official repositories as well as packages for classical
> "number crunching" such as petsc, trilinos and packages that depend on
> them: deal-ii, dune or ngsolve.

+1 on this too. My day job is supporting the Python scientific computing / data
science ecosystem, with a focus on packaging, so I am looking forward to this,
and helping out where I can.

> The sponsors of my application are Sven (svenstaro) and Bruno (archange).
> 
> I'm looking forward to the upcoming the discussion and your feedback on 
> my application.
> 
> Best,
> Torsten

That said, I skimmed Torsten's PKGBUILD, and the only thing I noticed was the
missing -DCMAKE_BUILD_TYPE=None argument from CMake packages, against the
recommendations from [1], which I wouldn't consider a bid deal anyway. So no
roast for me, against Sven's expectations :P

Overall, I am very happy we have someone interested in working on ROCm support
in the offical repos, and am looking forward to working with Torsten.

+1 on the candidate for me!

[1] https://wiki.archlinux.org/title/CMake_package_guidelines

Cheers,
Filipe Laíns

signature.asc
Description: This is a digitally signed message part

Re: TU Application - tpkessler

Reply via email to