Re: TU Application - tpkessler (Justin Kromlinger)

Justin Kromlinger Mon, 07 Nov 2022 13:20:01 -0800

On Mon,  7 Nov 2022 19:05:16 +0000
Torsten Keßler <t.kess...@posteo.de> wrote:


> Hi Justin!
> 
> > Some ROC repositories include documentation (cmake, device libs, hip), 
> > maybe it would make
> > sense to include those in `/usr/share/doc/${pkgname}`?  
> That's a very good idea. For some packages, AMD bundles them with the 
> package (rocm-dbgapi) and sometimes it's shipped separately, see hip-doc 
> [1].
> 
> > The limited support of ROCm has been one of the main things locking me into 
> > Nvidia for my
> > workstations.  
> Yes, that's really the main drawback of ROCm. CUDA works on almost any 
> Nvidia GPU (even on mobile variants). I hope AMD will change their 
> policy with Navi 30+.
> 
> > Have you tried contacting AMD about `rocm-core`?  
> Others already did. AMD supported promised to release the source code in 
> March [2].
> 
> > Finding information about ROCm support in consumer cards really isn't easy 
> > – but I guess with
> > CUDA I just expect it to work with recent Nvidia cards?  
> Do you mean the common HIP abstraction layer (like hipfft, hipblas,...)? 
> Yes, that should work with any recent CUDA version. But I haven't tried 
> this as I don't have access to an Nvidia GPU. Furthermore, this feature 
> (HIP with CUDA) has never been requested by the community at rocm-arch. 
> I think Nvidia users just stick with CUDA and don't need HIP.

I mean with ROCm I'm not sure if a GPU I'm going to buy will support it.
 
> > Maybe it would be a good idea to provide testing scripts / documents for 
> > them, so they can
> > report back once you push things into testing?  
> Absolutely! There's HIP examples [3] from AMD which checks basic HIP 
> language features. Additionally, we have `rocm-validation-suite` which 
> offers several tests.
> 
> > Having a list of tested cards in the wiki would be great as well.  
> I agree! Once we have an established test suite, this should be 
> straightforward.
> 
> Best!
> Torsten
> 
> [1] http://repo.radeon.com/rocm/apt/5.3/pool/main/h/hip-doc/
> [2] 
> https://github.com/RadeonOpenCompute/ROCm/issues/1705#issuecomment-1081599282
> [3] https://github.com/ROCm-Developer-Tools/HIP-Examples
> 
> Am 06.11.22 um 23:10 schrieb aur-general-requ...@lists.archlinux.org:
> > Send Aur-general mailing list submissions to
> >     aur-general@lists.archlinux.org
> >
> > To subscribe or unsubscribe via email, send a message with subject or
> > body 'help' to
> >     aur-general-requ...@lists.archlinux.org
> >
> > You can reach the person managing the list at
> >     aur-general-ow...@lists.archlinux.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of Aur-general digest..."
> >
> > Today's Topics:
> >
> >     1. Re: TU Application - tpkessler (Justin Kromlinger)
> >     2. Re: TU Application - tpkessler (Torsten Keßler)
> >     3. Re: TU Application - tpkessler (Filipe Laíns)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Sun, 6 Nov 2022 20:01:14 +0100
> > From: Justin Kromlinger <hashwo...@archlinux.org>
> > Subject: Re: TU Application - tpkessler
> > To: aur-general@lists.archlinux.org
> > Message-ID: <20221106200114.43840...@maker.hashworks.net>
> > Content-Type: multipart/signed;
> >     boundary="Sig_//b6Alp4sEqgQD9YkXUpB1=1";
> >     protocol="application/pgp-signature"; micalg=pgp-sha256
> >
> > Hi Torsten!
> >
> > On Wed, 26 Oct 2022 06:30:33 +0000
> > Torsten Keßler <t.kess...@posteo.de> wrote:
> >  
> >> Hi! I'm Torsten Keßler (tpkessler in AUR and on GitHub) from Saarland, a
> >> federal state in the south west of Germany. With this email
> >> I'm applying to be become a trusted user.
> >> After graduating with a PhD in applied mathematics this year I'm now
> >> a post-doc with a focus on numerical analysis, the art of solving physical
> >> problems with mathematically sound algorithms on a computer.
> >> I've been using Arch Linux on my private machines (and at work) since my
> >> first weeks at university ten years ago. After initial distro hopping a
> >> friend recommended Arch. I immediately liked the way it handles packages
> >> via pacman, its wiki and the flexibility of its installation process.  
> > Soon we can switch the Arch Linux IRC main language to German!
> >  
> >> Owing to their massively parallel architecture, GPUs have emerged as the
> >> leading platform for computationally expensive problems: Machine
> >> Learning/AI, real-world engineering problems, simulation of complex
> >> physical systems. For a long time, nVidia's CUDA framework (closed
> >> source, exclusively for their GPUs) has dominated this field. In 2015,
> >> AMD announced ROCm, their open source compute framework for GPUs. A
> >> common interface to CUDA, called HIP, makes it possible to write code
> >> that compiles and runs both on AMD and nVidia hardware. I've been
> >> closely following the development of ROCm on GitHub, trying to compile
> >> the stack from time to time. But only since 2020, the kernel includes
> >> all the necessary code to compile the ROCm stack on Arch Linux. Around
> >> this time I've started to contribute to rocm-arch on GitHub, a
> >> collection of PKGBUILDs for ROCm (with around 50 packages). Soon after
> >> that, I became the main contributor to the repository and, since 2021,
> >> I've been the maintainer of the whole ROCm stack.
> >> We have an active issue tracker and recently started a discussion page
> >> for rocm-arch. Most of the open issues as of now are for bookkeeping of
> >> patches we applied to run ROCm on Arch Linux. Many of them are linked to
> >> an upstream issue and a corresponding pull request that fixes the
> >> issues. This way I've already contributed code to a couple of libraries
> >> of the ROCm stack.
> >>
> >> Over the years, many libraries have added official support for ROCm,
> >> including tensorflow, pytorch, python-cupy, python-numba (not actively
> >> maintained anymore) and blender. Support of ROCm for the latter
> >> generated large interest in the community and is one reason Sven
> >> contacted me, asking me if I would be interested to take care of ROCm in
> >> [community]. In its current version, ROCm support for blender works out
> >> of the box. Just install hip-runtime-amd from the AUR and enable the HIP
> >> backend in blender's settings for rendering. The machine learning
> >> libraries require more dependencies from the AUR. Once installed,
> >> pytorch and tensorflow are known to work on Vega GPUs and the recent
> >> RDNA architecture.
> >>
> >> My first action as a TU would be to add basic support of ROCm to
> >> [community], i.e. the low level libraries, including HIP and an open
> >> source runtime for OpenCL based on ROCm. That would be enough to run
> >> blender with its ROCm backend. At the same time, I would expand the wiki
> >> article on ROCm. The interaction with the community would also move from
> >> the issue tracker of rocm-arch to the Arch Linux bug tracker and the
> >> forums. In a second phase I would add the high level libraries that
> >> would enable users to quickly compile and run complex libraries such as
> >> tensorflow, pytorch or cupy.  
> > The limited support of ROCm has been one of the main things locking me into 
> > Nvidia for my
> > workstations. Having stuff in community would certainly help with that!
> >  
> >> #BEGIN Technical details
> >>
> >> The minimal package list for HIP which includes the runtime libraries
> >> for basic GPU programming and the GPU compiler (hipcc) comprises eight
> >> packages
> >>
> >> * rocm-cmake (basic cmake files for ROCm)
> >> * rocm-llvm (upstream llvm with to-be-merged changes by AMD)
> >> * rocm-device-libs (implements math functions for all GPU architectures)
> >> * comgr (runtime library, "compiler support" for rocm-llvm)
> >> * hsakmt-roct (interface to the amdgpu kernel driver)
> >> * hsa-rocr (runtime for HSA compute kernels)
> >> * rocminfo (display information on HSA agents: GPU and possibly CPU)
> >> * hip-runtime-amd (runtime and compiler for HIP, a C++ dialect inspired
> >> by CUDA C++)  
> > PKGBUILDs look good to me. Some ROC repositories include documentation 
> > (cmake, device libs,
> > hip), maybe it would make sense to include those in 
> > `/usr/share/doc/${pkgname}`?
> >  
> >> All but rocm-llvm are small libraries under the permissive MIT license.
> >> Since ROCm 5.2, all packages successfully build in a clean chroot and
> >> are distributed in the community repo arch4edu.
> >>
> >> The application libraries for numerical linear algebra, sparse matrices
> >> or random numbers start with roc and hip (rocblas, rocsparse, rocrand).
> >> The hip* packages are designed in such a way that they would also work
> >> with CUDA if hip is configured with CUDA instead of a ROCm/HSA backend.
> >> With few exceptions (rocthrust, rccl) these packages are licensed under 
> >> MIT.
> >>
> >> Possible issues:
> >> There are three packages that are not fully working under Arch Linux or
> >> lack an open source license. The first is rocm-gdb, a fork of gdb with
> >> GPU support. To work properly it needs a kernel module currently not
> >> available in upstream linux but only as part of AMD's dkms modules. But
> >> they only work with specific kernel versions. Support for this from my
> >> side on Arch Linux was dropped a while ago. One closed source package is
> >> hsa-amd-aqlprofile. As the name suggests it is used for profiling as
> >> part of rocprofiler. Above mentioned packages are only required for
> >> debugging and profiling but are no runtime dependencies of the big
> >> machine learning libraries or any other package with ROCm support I'm
> >> aware of. The third package is rocm-core, a package only part of the
> >> meta packages for ROCm with no influence on the ROCm runtime. It
> >> provides a single header and a library with a single function that
> >> returns the current ROCm version. No source code has been published by
> >> AMD so far and the official package lacks a license file.  
> > Have you tried contacting AMD about `rocm-core`? It seems odd to keep such 
> > a small thing closed
> > source / without a license.
> >  
> >> A second issue is GPU support. AMD officially only supports the
> >> professional compute GPUs. This does not mean that ROCm is not working
> >> on consumer cards but merely that AMD cannot guarantee all
> >> functionalities through excessive testing. Recently, ROCm added support
> >> for Navi 21  (RX 6800 onwards), see
> >>
> >> https://docs.amd.com/bundle/Hardware_and_Software_Reference_Guide/page/Hardware_and_Software_Support.html
> >>
> >> I own a Vega 56 (gfx900) that is officially supported, so I can test all
> >> packages before publishing them on the AUR / in [community].  
> > Finding information about ROCm support in consumer cards really isn't easy 
> > – but I guess with
> > CUDA I just expect it to work with recent Nvidia cards?
> >
> > I would guess that we have a bunch of TUs with Radeon RX 5000/6000 (and 
> > soon 7000) series cards,
> > but without the needed knowledge / use case for ROCm. Maybe it would be a 
> > good idea to provide
> > testing scripts / documents for them, so they can report back once you push 
> > things into testing?
> >
> > Having a list of tested cards in the wiki would be great as well.
> >  
> >> #END Technical details
> >>
> >> On the long term, I would like to foster Arch Linux as the leading
> >> platform for scientific computing. This includes Machine Learning
> >> libraries in the official repositories as well as packages for classical
> >> "number crunching" such as petsc, trilinos and packages that depend on
> >> them: deal-ii, dune or ngsolve.
> >>
> >> The sponsors of my application are Sven (svenstaro) and Bruno (archange).
> >>
> >> I'm looking forward to the upcoming the discussion and your feedback on
> >> my application.
> >>
> >> Best,
> >> Torsten  
> > Best Regards
> > Justin
> >
> >
> >  
>

pgp1X6gKx5Z40.pgp
Description: OpenPGP digital signature

Re: TU Application - tpkessler (Justin Kromlinger)

Reply via email to