Re: TU Application - tpkessler (Justin Kromlinger)

Torsten Keßler Mon, 07 Nov 2022 11:05:29 -0800

Hi Justin!

Some ROC repositories include documentation (cmake, device libs, hip), maybe it 
would make sense to include those in `/usr/share/doc/${pkgname}`?

That's a very good idea. For some packages, AMD bundles them with the package (rocm-dbgapi) and sometimes it's shipped separately, see hip-doc [1].

The limited support of ROCm has been one of the main things locking me into 
Nvidia for my workstations.

Yes, that's really the main drawback of ROCm. CUDA works on almost any Nvidia GPU (even on mobile variants). I hope AMD will change their policy with Navi 30+.

Have you tried contacting AMD about `rocm-core`?

Others already did. AMD supported promised to release the source code in March [2].

Finding information about ROCm support in consumer cards really isn't easy – 
but I guess with CUDA I just expect it to work with recent Nvidia cards?

Do you mean the common HIP abstraction layer (like hipfft, hipblas,...)? Yes, that should work with any recent CUDA version. But I haven't tried this as I don't have access to an Nvidia GPU. Furthermore, this feature (HIP with CUDA) has never been requested by the community at rocm-arch. I think Nvidia users just stick with CUDA and don't need HIP.

Maybe it would be a good idea to provide testing scripts / documents for them, 
so they can report back once you push things into testing?

Absolutely! There's HIP examples [3] from AMD which checks basic HIP language features. Additionally, we have `rocm-validation-suite` which offers several tests.

Having a list of tested cards in the wiki would be great as well.

I agree! Once we have an established test suite, this should be straightforward.


Best!
Torsten

[1] http://repo.radeon.com/rocm/apt/5.3/pool/main/h/hip-doc/

[2] https://github.com/RadeonOpenCompute/ROCm/issues/1705#issuecomment-1081599282

[3] https://github.com/ROCm-Developer-Tools/HIP-Examples

Am 06.11.22 um 23:10 schrieb aur-general-requ...@lists.archlinux.org:

Send Aur-general mailing list submissions to
        aur-general@lists.archlinux.org

To subscribe or unsubscribe via email, send a message with subject or
body 'help' to
        aur-general-requ...@lists.archlinux.org

You can reach the person managing the list at
        aur-general-ow...@lists.archlinux.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Aur-general digest..."

Today's Topics:

    1. Re: TU Application - tpkessler (Justin Kromlinger)
    2. Re: TU Application - tpkessler (Torsten Keßler)
    3. Re: TU Application - tpkessler (Filipe Laíns)


----------------------------------------------------------------------

Message: 1
Date: Sun, 6 Nov 2022 20:01:14 +0100
From: Justin Kromlinger <hashwo...@archlinux.org>
Subject: Re: TU Application - tpkessler
To: aur-general@lists.archlinux.org
Message-ID: <20221106200114.43840...@maker.hashworks.net>
Content-Type: multipart/signed;
        boundary="Sig_//b6Alp4sEqgQD9YkXUpB1=1";
        protocol="application/pgp-signature"; micalg=pgp-sha256

Hi Torsten!

On Wed, 26 Oct 2022 06:30:33 +0000
Torsten Keßler <t.kess...@posteo.de> wrote:

Hi! I'm Torsten Keßler (tpkessler in AUR and on GitHub) from Saarland, a
federal state in the south west of Germany. With this email
I'm applying to be become a trusted user.
After graduating with a PhD in applied mathematics this year I'm now
a post-doc with a focus on numerical analysis, the art of solving physical
problems with mathematically sound algorithms on a computer.
I've been using Arch Linux on my private machines (and at work) since my
first weeks at university ten years ago. After initial distro hopping a
friend recommended Arch. I immediately liked the way it handles packages
via pacman, its wiki and the flexibility of its installation process.

Soon we can switch the Arch Linux IRC main language to German!

Owing to their massively parallel architecture, GPUs have emerged as the
leading platform for computationally expensive problems: Machine
Learning/AI, real-world engineering problems, simulation of complex
physical systems. For a long time, nVidia's CUDA framework (closed
source, exclusively for their GPUs) has dominated this field. In 2015,
AMD announced ROCm, their open source compute framework for GPUs. A
common interface to CUDA, called HIP, makes it possible to write code
that compiles and runs both on AMD and nVidia hardware. I've been
closely following the development of ROCm on GitHub, trying to compile
the stack from time to time. But only since 2020, the kernel includes
all the necessary code to compile the ROCm stack on Arch Linux. Around
this time I've started to contribute to rocm-arch on GitHub, a
collection of PKGBUILDs for ROCm (with around 50 packages). Soon after
that, I became the main contributor to the repository and, since 2021,
I've been the maintainer of the whole ROCm stack.
We have an active issue tracker and recently started a discussion page
for rocm-arch. Most of the open issues as of now are for bookkeeping of
patches we applied to run ROCm on Arch Linux. Many of them are linked to
an upstream issue and a corresponding pull request that fixes the
issues. This way I've already contributed code to a couple of libraries
of the ROCm stack.

Over the years, many libraries have added official support for ROCm,
including tensorflow, pytorch, python-cupy, python-numba (not actively
maintained anymore) and blender. Support of ROCm for the latter
generated large interest in the community and is one reason Sven
contacted me, asking me if I would be interested to take care of ROCm in
[community]. In its current version, ROCm support for blender works out
of the box. Just install hip-runtime-amd from the AUR and enable the HIP
backend in blender's settings for rendering. The machine learning
libraries require more dependencies from the AUR. Once installed,
pytorch and tensorflow are known to work on Vega GPUs and the recent
RDNA architecture.

My first action as a TU would be to add basic support of ROCm to
[community], i.e. the low level libraries, including HIP and an open
source runtime for OpenCL based on ROCm. That would be enough to run
blender with its ROCm backend. At the same time, I would expand the wiki
article on ROCm. The interaction with the community would also move from
the issue tracker of rocm-arch to the Arch Linux bug tracker and the
forums. In a second phase I would add the high level libraries that
would enable users to quickly compile and run complex libraries such as
tensorflow, pytorch or cupy.

The limited support of ROCm has been one of the main things locking me into 
Nvidia for my
workstations. Having stuff in community would certainly help with that!

#BEGIN Technical details

The minimal package list for HIP which includes the runtime libraries
for basic GPU programming and the GPU compiler (hipcc) comprises eight
packages

* rocm-cmake (basic cmake files for ROCm)
* rocm-llvm (upstream llvm with to-be-merged changes by AMD)
* rocm-device-libs (implements math functions for all GPU architectures)
* comgr (runtime library, "compiler support" for rocm-llvm)
* hsakmt-roct (interface to the amdgpu kernel driver)
* hsa-rocr (runtime for HSA compute kernels)
* rocminfo (display information on HSA agents: GPU and possibly CPU)
* hip-runtime-amd (runtime and compiler for HIP, a C++ dialect inspired
by CUDA C++)

PKGBUILDs look good to me. Some ROC repositories include documentation (cmake, 
device libs, hip),
maybe it would make sense to include those in `/usr/share/doc/${pkgname}`?

All but rocm-llvm are small libraries under the permissive MIT license.
Since ROCm 5.2, all packages successfully build in a clean chroot and
are distributed in the community repo arch4edu.

The application libraries for numerical linear algebra, sparse matrices
or random numbers start with roc and hip (rocblas, rocsparse, rocrand).
The hip* packages are designed in such a way that they would also work
with CUDA if hip is configured with CUDA instead of a ROCm/HSA backend.
With few exceptions (rocthrust, rccl) these packages are licensed under MIT.

Possible issues:
There are three packages that are not fully working under Arch Linux or
lack an open source license. The first is rocm-gdb, a fork of gdb with
GPU support. To work properly it needs a kernel module currently not
available in upstream linux but only as part of AMD's dkms modules. But
they only work with specific kernel versions. Support for this from my
side on Arch Linux was dropped a while ago. One closed source package is
hsa-amd-aqlprofile. As the name suggests it is used for profiling as
part of rocprofiler. Above mentioned packages are only required for
debugging and profiling but are no runtime dependencies of the big
machine learning libraries or any other package with ROCm support I'm
aware of. The third package is rocm-core, a package only part of the
meta packages for ROCm with no influence on the ROCm runtime. It
provides a single header and a library with a single function that
returns the current ROCm version. No source code has been published by
AMD so far and the official package lacks a license file.

Have you tried contacting AMD about `rocm-core`? It seems odd to keep such a 
small thing closed
source / without a license.

A second issue is GPU support. AMD officially only supports the
professional compute GPUs. This does not mean that ROCm is not working
on consumer cards but merely that AMD cannot guarantee all
functionalities through excessive testing. Recently, ROCm added support
for Navi 21  (RX 6800 onwards), see

https://docs.amd.com/bundle/Hardware_and_Software_Reference_Guide/page/Hardware_and_Software_Support.html

I own a Vega 56 (gfx900) that is officially supported, so I can test all
packages before publishing them on the AUR / in [community].

Finding information about ROCm support in consumer cards really isn't easy – 
but I guess with CUDA
I just expect it to work with recent Nvidia cards?

I would guess that we have a bunch of TUs with Radeon RX 5000/6000 (and soon 
7000) series cards,
but without the needed knowledge / use case for ROCm. Maybe it would be a good 
idea to provide
testing scripts / documents for them, so they can report back once you push 
things into testing?

Having a list of tested cards in the wiki would be great as well.

#END Technical details

On the long term, I would like to foster Arch Linux as the leading
platform for scientific computing. This includes Machine Learning
libraries in the official repositories as well as packages for classical
"number crunching" such as petsc, trilinos and packages that depend on
them: deal-ii, dune or ngsolve.

The sponsors of my application are Sven (svenstaro) and Bruno (archange).

I'm looking forward to the upcoming the discussion and your feedback on
my application.

Best,
Torsten

Best Regards
Justin

OpenPGP_0xAF5FF1C93CF09D12.asc
Description: OpenPGP public key

OpenPGP_signature
Description: OpenPGP digital signature

Re: TU Application - tpkessler (Justin Kromlinger)

Reply via email to