wiki/docs/events/meeting-notes/2024-02-22.md

2.3 KiB

SIG/HPC Meeting 2024-02-22

Attendees

  • Sherif Nagy
  • Neil Hanlon
  • Alan Marshall
  • Brian Peters
  • Chris Simmons
  • Brian Phan
  • Forrest Burt

Follow Ups

  • NVIDIA GPU driver Testing - Chris
    • https://github.com/mghpcsim/gpu-testing/tree/master
    • documented process for configuring instance, installing drivers (open source or proprietary), setting up container runtimes, nvidia container toolkit
      • Benchmarks using forked toolkit from Lambda labs with Rocky customizations
    • initial control benchmark (pytorch):
      • closed drivers slightly (4s) faster
    • Plan: run benchmarks on progressively newer instances and collect results
      • Publish results on Wiki
  • Intel driver - Met with them, went well
    • Can build this driver into signed kernel modules, add to testing Chris is doing
    • This will live in SIG/Kernel because it's a kernel module
    • driver toolkit pieces probably will end up in HPC SIG
  • Kernel Cnode (for MoS)
    • Sherif synced with Jeremy
      • Lots of progress has been made, almost all patches backported
      • there are couple problematic patches--they're based on SLES kernels, but a bit different enough to be problematic
        • Pablo will help once the problem is set

Discussions

  • Testing - Warewulf, others
    • Sherif and Brian Phan synced on warewulf testing
    • Not just installibility, upgrade path, etc
    • What can we use? Multiple things, probably
      • OpenQA? TMT? Zuul? Whatever OpenHPC uses?
      • Testing team would also love to get more people involved and participating in building tests
    • Example tests:
      • Provision cluster
      • nodes communicate
      • etc
    • Want: have full end to end testing of all components
    • What tests do we want?
      • Functional
        • Create cluster
        • Create user
        • Submit job as user
        • Future:
          • Slurm accounting/dbd, others
  • Package tracking - PoI tracker
    • Neil is looking how we can integrate this
  • Wiki Updates - Neil and Sherif will work on this at FOSDEM/CentOS Connect
    • This didn't really happen specifically, but discussions about ensuring Wikis are up to date did happen

Open Floor

  • N/A

Action Items

  • Sherif to build and release intel driver
  • Sherif and Brian to work on defining tests that we want to run
  • Neil to work on package update notifications