[Beowulf] AMD is looking for expert HPC/AI sysadmins/SRE

Joe Landman Wed, 11 Jun 2025 18:09:08 -0700

Hi folks:

Quick post for the day job. AMD (my employer) is looking for expertsystems administrators for a mix of our internal HPC systems, andhelping customers stand up their AI and HPC clusters.

AMD systems include a small version of Frontier, some El Capadjacent nodes, and a variety of large GPU accelerator based nodes. Customer systems range from smaller 64 node systems through multipleorders of magnitude larger systems.


   Needed skills/attributes include:

 * 5+ years in an HPC systems admin/HPC SRE role
 * expert Linux knowledge, debugging, problem resolution
 * strong hardware debugging experience
 * SLURM management, setup, configuration
 * development experience in Python, Bash, C/C++
 * RDMA network setup/config/testing
 * Benchmarking and performance measurement
 * Monitoring systems
 * Storage systems, including Lustre, NFS, BeeGFS, etc.
 * Installing and configuring device drivers for advanced hardware:
   GPUs and networks
 * Modules and configuration (HPE/Cray and lmod)
 * capability to work in/around AMD and customer data centers, and
   occasional travel to those DCs

    Desired experience/attributes include:

 * Proximity to Austin Tx, or Santa Clara/San Jose offices, though
   remote is possible
 * CUDA and/or ROCM experience
 * HPE/Cray programming environment and modules
 * familiarity with AI frameworks
 * US Citizenship or green card

I don't have a job req to point to yet, but should have this soon. You can reach me here, or on https://linkedin.com/in/joelandman . I amthe hiring manager.


  Regards

Joe

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

[Beowulf] AMD is looking for expert HPC/AI sysadmins/SRE

Reply via email to