add meeting minutes for 2024-02-08, 2024-02-22 #25

Merged
neil merged 1 commits from meeting/2024-02-22 into main 2024-02-22 21:37:09 +00:00
2 changed files with 101 additions and 0 deletions

View File

@ -0,0 +1,35 @@
# SIG/HPC Meeting 2024-02-08
## Attendees
* Sherif Nagy
* Neil Hanlon
* Chris Simmons
(Neil forgot to take attendence)
## Follow Ups
* Slurm 23.11.5 in production
* Adjust conflicts and provides for older packages
* Meeting with intel on Monday re: GPU drivers; need insight on testin
* Monday @4PM Eastern (?) - chris will invite NEil
* Secureboot support?
* Driver is fully open source
* no update from chris on PMIX
* no movement on Lustre filesystem yet
* Neil to put in [tickets](https://git.resf.org/sig_hpc/meta/issues) actually for [[Meeting/2024-01-25/Rocky/SIG/HPC|Last Meeting]]
* Brian Phan and Forrest Burt gave talks on Warewulf/Apptainer
* Sherif and Brian met up at FOSDEM and discussed testing for WW, and what we can/should test
## Discussions
* (Neil had to leave early)
### Open Floor
* N/A
### Action Items
* N/A

View File

@ -0,0 +1,66 @@
# SIG/HPC Meeting 2024-02-22
## Attendees
* Sherif Nagy
* Neil Hanlon
* Alan Marshall
* Brian Peters
* Chris Simmons
* Brian Phan
* Forrest Burt
## Follow Ups
* NVIDIA GPU driver Testing - Chris
* https://github.com/mghpcsim/gpu-testing/tree/master
* documented process for configuring instance, installing drivers (open source or proprietary), setting up container runtimes, nvidia container toolkit
* Benchmarks using forked toolkit from Lambda labs with Rocky customizations
* initial control benchmark (pytorch):
* closed drivers slightly (4s) faster
* Plan: run benchmarks on progressively newer instances and collect results
* Publish results on Wiki
* Intel driver - Met with them, went well
* Can build this driver into signed kernel modules, add to testing Chris is doing
* This will live in SIG/Kernel because it's a kernel module
* driver toolkit pieces probably will end up in HPC SIG
* Kernel Cnode (for MoS)
* Sherif synced with Jeremy
* Lots of progress has been made, almost all patches backported
* there are couple problematic patches--they're based on SLES kernels, but a bit different enough to be problematic
* Pablo will help once the problem is set
## Discussions
* Testing - Warewulf, others
* Sherif and Brian Phan synced on warewulf testing
* Not *just* installibility, upgrade path, etc
* What can we use? Multiple things, probably
* OpenQA? TMT? Zuul? Whatever OpenHPC uses?
* Testing team would also love to get more people involved and participating in building tests
* Example tests:
* Provision cluster
* nodes communicate
* etc
* Want: have full end to end testing of all components
* What tests do we want?
* Functional
* Create cluster
* Create user
* Submit job as user
* Future:
* Slurm accounting/dbd, others
* Package tracking - PoI tracker
* Neil is looking how we can integrate this
* Wiki Updates - Neil and Sherif will work on this at FOSDEM/CentOS Connect
* This didn't really happen specifically, but discussions about ensuring Wikis are up to date did happen
### Open Floor
* N/A
### Action Items
* Sherif to build and release intel driver
* Sherif and Brian to work on defining tests that we want to run
* Neil to work on package update notifications