Skip to content

SIG/HPC Meeting 2024-02-22

Attendees

  • Sherif Nagy
  • Neil Hanlon
  • Alan Marshall
  • Brian Peters
  • Chris Simmons
  • Brian Phan
  • Forrest Burt

Follow Ups

  • NVIDIA GPU driver Testing - Chris
  • https://github.com/mghpcsim/gpu-testing/tree/master
  • documented process for configuring instance, installing drivers (open source or proprietary), setting up container runtimes, nvidia container toolkit
    • Benchmarks using forked toolkit from Lambda labs with Rocky customizations
  • initial control benchmark (pytorch):
    • closed drivers slightly (4s) faster
  • Plan: run benchmarks on progressively newer instances and collect results
    • Publish results on Wiki
  • Intel driver - Met with them, went well
  • Can build this driver into signed kernel modules, add to testing Chris is doing
  • This will live in SIG/Kernel because it's a kernel module
  • driver toolkit pieces probably will end up in HPC SIG
  • Kernel Cnode (for MoS)
  • Sherif synced with Jeremy
    • Lots of progress has been made, almost all patches backported
    • there are couple problematic patches--they're based on SLES kernels, but a bit different enough to be problematic
    • Pablo will help once the problem is set

Discussions

  • Testing - Warewulf, others
  • Sherif and Brian Phan synced on warewulf testing
  • Not just installibility, upgrade path, etc
  • What can we use? Multiple things, probably
    • OpenQA? TMT? Zuul? Whatever OpenHPC uses?
    • Testing team would also love to get more people involved and participating in building tests
  • Example tests:
    • Provision cluster
    • nodes communicate
    • etc
  • Want: have full end to end testing of all components
  • What tests do we want?
    • Functional
    • Create cluster
    • Create user
    • Submit job as user
    • Future:
      • Slurm accounting/dbd, others
  • Package tracking - PoI tracker
  • Neil is looking how we can integrate this
  • Wiki Updates - Neil and Sherif will work on this at FOSDEM/CentOS Connect
  • This didn't really happen specifically, but discussions about ensuring Wikis are up to date did happen

Open Floor

  • N/A

Action Items

  • Sherif to build and release intel driver
  • Sherif and Brian to work on defining tests that we want to run
  • Neil to work on package update notifications