Hi folks:

   Quick post for the day job.  AMD (my employer) is looking for expert systems administrators for a mix of our internal HPC systems, and helping customers stand up their AI and HPC clusters.

   AMD systems include a small version of Frontier, some El Cap adjacent nodes, and a variety of large GPU accelerator based nodes.  Customer systems range from smaller 64 node systems through multiple orders of magnitude larger systems.

   Needed skills/attributes include:

 * 5+ years in an HPC systems admin/HPC SRE role
 * expert Linux knowledge, debugging, problem resolution
 * strong hardware debugging experience
 * SLURM management, setup, configuration
 * development experience in Python, Bash, C/C++
 * RDMA network setup/config/testing
 * Benchmarking and performance measurement
 * Monitoring systems
 * Storage systems, including Lustre, NFS, BeeGFS, etc.
 * Installing and configuring device drivers for advanced hardware:
   GPUs and networks
 * Modules and configuration (HPE/Cray and lmod)
 * capability to work in/around AMD and customer data centers, and
   occasional travel to those DCs

    Desired experience/attributes include:

 * Proximity to Austin Tx, or Santa Clara/San Jose offices, though
   remote is possible
 * CUDA and/or ROCM experience
 * HPE/Cray programming environment and modules
 * familiarity with AI frameworks
 * US Citizenship or green card

  I don't have a job req to point to yet, but should have this soon.  You can reach me here, or on https://linkedin.com/in/joelandman .  I am the hiring manager.

  Regards

Joe
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Reply via email to