generated from sig_core/wiki-template
add meeting minutes for 2024-02-08, 2024-02-22 #25
35
docs/events/meeting-notes/2024-02-08.md
Normal file
35
docs/events/meeting-notes/2024-02-08.md
Normal file
@ -0,0 +1,35 @@
|
||||
# SIG/HPC Meeting 2024-02-08
|
||||
|
||||
## Attendees
|
||||
|
||||
* Sherif Nagy
|
||||
* Neil Hanlon
|
||||
* Chris Simmons
|
||||
|
||||
(Neil forgot to take attendence)
|
||||
|
||||
## Follow Ups
|
||||
|
||||
* Slurm 23.11.5 in production
|
||||
* Adjust conflicts and provides for older packages
|
||||
* Meeting with intel on Monday re: GPU drivers; need insight on testin
|
||||
* Monday @4PM Eastern (?) - chris will invite NEil
|
||||
* Secureboot support?
|
||||
* Driver is fully open source
|
||||
* no update from chris on PMIX
|
||||
* no movement on Lustre filesystem yet
|
||||
* Neil to put in [tickets](https://git.resf.org/sig_hpc/meta/issues) actually for [[Meeting/2024-01-25/Rocky/SIG/HPC|Last Meeting]]
|
||||
* Brian Phan and Forrest Burt gave talks on Warewulf/Apptainer
|
||||
* Sherif and Brian met up at FOSDEM and discussed testing for WW, and what we can/should test
|
||||
|
||||
## Discussions
|
||||
|
||||
* (Neil had to leave early)
|
||||
|
||||
### Open Floor
|
||||
|
||||
* N/A
|
||||
|
||||
### Action Items
|
||||
|
||||
* N/A
|
66
docs/events/meeting-notes/2024-02-22.md
Normal file
66
docs/events/meeting-notes/2024-02-22.md
Normal file
@ -0,0 +1,66 @@
|
||||
# SIG/HPC Meeting 2024-02-22
|
||||
|
||||
## Attendees
|
||||
|
||||
* Sherif Nagy
|
||||
* Neil Hanlon
|
||||
* Alan Marshall
|
||||
* Brian Peters
|
||||
* Chris Simmons
|
||||
* Brian Phan
|
||||
* Forrest Burt
|
||||
|
||||
## Follow Ups
|
||||
|
||||
* NVIDIA GPU driver Testing - Chris
|
||||
* https://github.com/mghpcsim/gpu-testing/tree/master
|
||||
* documented process for configuring instance, installing drivers (open source or proprietary), setting up container runtimes, nvidia container toolkit
|
||||
* Benchmarks using forked toolkit from Lambda labs with Rocky customizations
|
||||
* initial control benchmark (pytorch):
|
||||
* closed drivers slightly (4s) faster
|
||||
* Plan: run benchmarks on progressively newer instances and collect results
|
||||
* Publish results on Wiki
|
||||
* Intel driver - Met with them, went well
|
||||
* Can build this driver into signed kernel modules, add to testing Chris is doing
|
||||
* This will live in SIG/Kernel because it's a kernel module
|
||||
* driver toolkit pieces probably will end up in HPC SIG
|
||||
* Kernel Cnode (for MoS)
|
||||
* Sherif synced with Jeremy
|
||||
* Lots of progress has been made, almost all patches backported
|
||||
* there are couple problematic patches--they're based on SLES kernels, but a bit different enough to be problematic
|
||||
* Pablo will help once the problem is set
|
||||
|
||||
## Discussions
|
||||
|
||||
* Testing - Warewulf, others
|
||||
* Sherif and Brian Phan synced on warewulf testing
|
||||
* Not *just* installibility, upgrade path, etc
|
||||
* What can we use? Multiple things, probably
|
||||
* OpenQA? TMT? Zuul? Whatever OpenHPC uses?
|
||||
* Testing team would also love to get more people involved and participating in building tests
|
||||
* Example tests:
|
||||
* Provision cluster
|
||||
* nodes communicate
|
||||
* etc
|
||||
* Want: have full end to end testing of all components
|
||||
* What tests do we want?
|
||||
* Functional
|
||||
* Create cluster
|
||||
* Create user
|
||||
* Submit job as user
|
||||
* Future:
|
||||
* Slurm accounting/dbd, others
|
||||
* Package tracking - PoI tracker
|
||||
* Neil is looking how we can integrate this
|
||||
* Wiki Updates - Neil and Sherif will work on this at FOSDEM/CentOS Connect
|
||||
* This didn't really happen specifically, but discussions about ensuring Wikis are up to date did happen
|
||||
|
||||
### Open Floor
|
||||
|
||||
* N/A
|
||||
|
||||
### Action Items
|
||||
|
||||
* Sherif to build and release intel driver
|
||||
* Sherif and Brian to work on defining tests that we want to run
|
||||
* Neil to work on package update notifications
|
Loading…
Reference in New Issue
Block a user