generated from sig_core/wiki-template
2.3 KiB
2.3 KiB
SIG/HPC Meeting 2024-02-22
Attendees
- Sherif Nagy
- Neil Hanlon
- Alan Marshall
- Brian Peters
- Chris Simmons
- Brian Phan
- Forrest Burt
Follow Ups
- NVIDIA GPU driver Testing - Chris
- https://github.com/mghpcsim/gpu-testing/tree/master
- documented process for configuring instance, installing drivers (open source or proprietary), setting up container runtimes, nvidia container toolkit
- Benchmarks using forked toolkit from Lambda labs with Rocky customizations
- initial control benchmark (pytorch):
- closed drivers slightly (4s) faster
- Plan: run benchmarks on progressively newer instances and collect results
- Publish results on Wiki
- Intel driver - Met with them, went well
- Can build this driver into signed kernel modules, add to testing Chris is doing
- This will live in SIG/Kernel because it's a kernel module
- driver toolkit pieces probably will end up in HPC SIG
- Kernel Cnode (for MoS)
- Sherif synced with Jeremy
- Lots of progress has been made, almost all patches backported
- there are couple problematic patches--they're based on SLES kernels, but a bit different enough to be problematic
- Pablo will help once the problem is set
- Sherif synced with Jeremy
Discussions
- Testing - Warewulf, others
- Sherif and Brian Phan synced on warewulf testing
- Not just installibility, upgrade path, etc
- What can we use? Multiple things, probably
- OpenQA? TMT? Zuul? Whatever OpenHPC uses?
- Testing team would also love to get more people involved and participating in building tests
- Example tests:
- Provision cluster
- nodes communicate
- etc
- Want: have full end to end testing of all components
- What tests do we want?
- Functional
- Create cluster
- Create user
- Submit job as user
- Future:
- Slurm accounting/dbd, others
- Functional
- Package tracking - PoI tracker
- Neil is looking how we can integrate this
- Wiki Updates - Neil and Sherif will work on this at FOSDEM/CentOS Connect
- This didn't really happen specifically, but discussions about ensuring Wikis are up to date did happen
Open Floor
- N/A
Action Items
- Sherif to build and release intel driver
- Sherif and Brian to work on defining tests that we want to run
- Neil to work on package update notifications