generated from sig_core/wiki-template
67 lines
2.3 KiB
Markdown
67 lines
2.3 KiB
Markdown
|
# SIG/HPC Meeting 2024-02-22
|
||
|
|
||
|
## Attendees
|
||
|
|
||
|
* Sherif Nagy
|
||
|
* Neil Hanlon
|
||
|
* Alan Marshall
|
||
|
* Brian Peters
|
||
|
* Chris Simmons
|
||
|
* Brian Phan
|
||
|
* Forrest Burt
|
||
|
|
||
|
## Follow Ups
|
||
|
|
||
|
* NVIDIA GPU driver Testing - Chris
|
||
|
* https://github.com/mghpcsim/gpu-testing/tree/master
|
||
|
* documented process for configuring instance, installing drivers (open source or proprietary), setting up container runtimes, nvidia container toolkit
|
||
|
* Benchmarks using forked toolkit from Lambda labs with Rocky customizations
|
||
|
* initial control benchmark (pytorch):
|
||
|
* closed drivers slightly (4s) faster
|
||
|
* Plan: run benchmarks on progressively newer instances and collect results
|
||
|
* Publish results on Wiki
|
||
|
* Intel driver - Met with them, went well
|
||
|
* Can build this driver into signed kernel modules, add to testing Chris is doing
|
||
|
* This will live in SIG/Kernel because it's a kernel module
|
||
|
* driver toolkit pieces probably will end up in HPC SIG
|
||
|
* Kernel Cnode (for MoS)
|
||
|
* Sherif synced with Jeremy
|
||
|
* Lots of progress has been made, almost all patches backported
|
||
|
* there are couple problematic patches--they're based on SLES kernels, but a bit different enough to be problematic
|
||
|
* Pablo will help once the problem is set
|
||
|
|
||
|
## Discussions
|
||
|
|
||
|
* Testing - Warewulf, others
|
||
|
* Sherif and Brian Phan synced on warewulf testing
|
||
|
* Not *just* installibility, upgrade path, etc
|
||
|
* What can we use? Multiple things, probably
|
||
|
* OpenQA? TMT? Zuul? Whatever OpenHPC uses?
|
||
|
* Testing team would also love to get more people involved and participating in building tests
|
||
|
* Example tests:
|
||
|
* Provision cluster
|
||
|
* nodes communicate
|
||
|
* etc
|
||
|
* Want: have full end to end testing of all components
|
||
|
* What tests do we want?
|
||
|
* Functional
|
||
|
* Create cluster
|
||
|
* Create user
|
||
|
* Submit job as user
|
||
|
* Future:
|
||
|
* Slurm accounting/dbd, others
|
||
|
* Package tracking - PoI tracker
|
||
|
* Neil is looking how we can integrate this
|
||
|
* Wiki Updates - Neil and Sherif will work on this at FOSDEM/CentOS Connect
|
||
|
* This didn't really happen specifically, but discussions about ensuring Wikis are up to date did happen
|
||
|
|
||
|
### Open Floor
|
||
|
|
||
|
* N/A
|
||
|
|
||
|
### Action Items
|
||
|
|
||
|
* Sherif to build and release intel driver
|
||
|
* Sherif and Brian to work on defining tests that we want to run
|
||
|
* Neil to work on package update notifications
|