wiki/docs/events/meeting-notes/2024-02-22.md

67 lines
2.3 KiB
Markdown
Raw Normal View History

# SIG/HPC Meeting 2024-02-22
## Attendees
* Sherif Nagy
* Neil Hanlon
* Alan Marshall
* Brian Peters
* Chris Simmons
* Brian Phan
* Forrest Burt
## Follow Ups
* NVIDIA GPU driver Testing - Chris
* https://github.com/mghpcsim/gpu-testing/tree/master
* documented process for configuring instance, installing drivers (open source or proprietary), setting up container runtimes, nvidia container toolkit
* Benchmarks using forked toolkit from Lambda labs with Rocky customizations
* initial control benchmark (pytorch):
* closed drivers slightly (4s) faster
* Plan: run benchmarks on progressively newer instances and collect results
* Publish results on Wiki
* Intel driver - Met with them, went well
* Can build this driver into signed kernel modules, add to testing Chris is doing
* This will live in SIG/Kernel because it's a kernel module
* driver toolkit pieces probably will end up in HPC SIG
* Kernel Cnode (for MoS)
* Sherif synced with Jeremy
* Lots of progress has been made, almost all patches backported
* there are couple problematic patches--they're based on SLES kernels, but a bit different enough to be problematic
* Pablo will help once the problem is set
## Discussions
* Testing - Warewulf, others
* Sherif and Brian Phan synced on warewulf testing
* Not *just* installibility, upgrade path, etc
* What can we use? Multiple things, probably
* OpenQA? TMT? Zuul? Whatever OpenHPC uses?
* Testing team would also love to get more people involved and participating in building tests
* Example tests:
* Provision cluster
* nodes communicate
* etc
* Want: have full end to end testing of all components
* What tests do we want?
* Functional
* Create cluster
* Create user
* Submit job as user
* Future:
* Slurm accounting/dbd, others
* Package tracking - PoI tracker
* Neil is looking how we can integrate this
* Wiki Updates - Neil and Sherif will work on this at FOSDEM/CentOS Connect
* This didn't really happen specifically, but discussions about ensuring Wikis are up to date did happen
### Open Floor
* N/A
### Action Items
* Sherif to build and release intel driver
* Sherif and Brian to work on defining tests that we want to run
* Neil to work on package update notifications