generated from sig_core/wiki-template
add meeting minutes for 2024-02-08, 2024-02-22
This commit is contained in:
parent
6f3fe414da
commit
47af201e76
35
docs/events/meeting-notes/2024-02-08.md
Normal file
35
docs/events/meeting-notes/2024-02-08.md
Normal file
@ -0,0 +1,35 @@
|
|||||||
|
# SIG/HPC Meeting 2024-02-08
|
||||||
|
|
||||||
|
## Attendees
|
||||||
|
|
||||||
|
* Sherif Nagy
|
||||||
|
* Neil Hanlon
|
||||||
|
* Chris Simmons
|
||||||
|
|
||||||
|
(Neil forgot to take attendence)
|
||||||
|
|
||||||
|
## Follow Ups
|
||||||
|
|
||||||
|
* Slurm 23.11.5 in production
|
||||||
|
* Adjust conflicts and provides for older packages
|
||||||
|
* Meeting with intel on Monday re: GPU drivers; need insight on testin
|
||||||
|
* Monday @4PM Eastern (?) - chris will invite NEil
|
||||||
|
* Secureboot support?
|
||||||
|
* Driver is fully open source
|
||||||
|
* no update from chris on PMIX
|
||||||
|
* no movement on Lustre filesystem yet
|
||||||
|
* Neil to put in [tickets](https://git.resf.org/sig_hpc/meta/issues) actually for [[Meeting/2024-01-25/Rocky/SIG/HPC|Last Meeting]]
|
||||||
|
* Brian Phan and Forrest Burt gave talks on Warewulf/Apptainer
|
||||||
|
* Sherif and Brian met up at FOSDEM and discussed testing for WW, and what we can/should test
|
||||||
|
|
||||||
|
## Discussions
|
||||||
|
|
||||||
|
* (Neil had to leave early)
|
||||||
|
|
||||||
|
### Open Floor
|
||||||
|
|
||||||
|
* N/A
|
||||||
|
|
||||||
|
### Action Items
|
||||||
|
|
||||||
|
* N/A
|
66
docs/events/meeting-notes/2024-02-22.md
Normal file
66
docs/events/meeting-notes/2024-02-22.md
Normal file
@ -0,0 +1,66 @@
|
|||||||
|
# SIG/HPC Meeting 2024-02-22
|
||||||
|
|
||||||
|
## Attendees
|
||||||
|
|
||||||
|
* Sherif Nagy
|
||||||
|
* Neil Hanlon
|
||||||
|
* Alan Marshall
|
||||||
|
* Brian Peters
|
||||||
|
* Chris Simmons
|
||||||
|
* Brian Phan
|
||||||
|
* Forrest Burt
|
||||||
|
|
||||||
|
## Follow Ups
|
||||||
|
|
||||||
|
* NVIDIA GPU driver Testing - Chris
|
||||||
|
* https://github.com/mghpcsim/gpu-testing/tree/master
|
||||||
|
* documented process for configuring instance, installing drivers (open source or proprietary), setting up container runtimes, nvidia container toolkit
|
||||||
|
* Benchmarks using forked toolkit from Lambda labs with Rocky customizations
|
||||||
|
* initial control benchmark (pytorch):
|
||||||
|
* closed drivers slightly (4s) faster
|
||||||
|
* Plan: run benchmarks on progressively newer instances and collect results
|
||||||
|
* Publish results on Wiki
|
||||||
|
* Intel driver - Met with them, went well
|
||||||
|
* Can build this driver into signed kernel modules, add to testing Chris is doing
|
||||||
|
* This will live in SIG/Kernel because it's a kernel module
|
||||||
|
* driver toolkit pieces probably will end up in HPC SIG
|
||||||
|
* Kernel Cnode (for MoS)
|
||||||
|
* Sherif synced with Jeremy
|
||||||
|
* Lots of progress has been made, almost all patches backported
|
||||||
|
* there are couple problematic patches--they're based on SLES kernels, but a bit different enough to be problematic
|
||||||
|
* Pablo will help once the problem is set
|
||||||
|
|
||||||
|
## Discussions
|
||||||
|
|
||||||
|
* Testing - Warewulf, others
|
||||||
|
* Sherif and Brian Phan synced on warewulf testing
|
||||||
|
* Not *just* installibility, upgrade path, etc
|
||||||
|
* What can we use? Multiple things, probably
|
||||||
|
* OpenQA? TMT? Zuul? Whatever OpenHPC uses?
|
||||||
|
* Testing team would also love to get more people involved and participating in building tests
|
||||||
|
* Example tests:
|
||||||
|
* Provision cluster
|
||||||
|
* nodes communicate
|
||||||
|
* etc
|
||||||
|
* Want: have full end to end testing of all components
|
||||||
|
* What tests do we want?
|
||||||
|
* Functional
|
||||||
|
* Create cluster
|
||||||
|
* Create user
|
||||||
|
* Submit job as user
|
||||||
|
* Future:
|
||||||
|
* Slurm accounting/dbd, others
|
||||||
|
* Package tracking - PoI tracker
|
||||||
|
* Neil is looking how we can integrate this
|
||||||
|
* Wiki Updates - Neil and Sherif will work on this at FOSDEM/CentOS Connect
|
||||||
|
* This didn't really happen specifically, but discussions about ensuring Wikis are up to date did happen
|
||||||
|
|
||||||
|
### Open Floor
|
||||||
|
|
||||||
|
* N/A
|
||||||
|
|
||||||
|
### Action Items
|
||||||
|
|
||||||
|
* Sherif to build and release intel driver
|
||||||
|
* Sherif and Brian to work on defining tests that we want to run
|
||||||
|
* Neil to work on package update notifications
|
Loading…
Reference in New Issue
Block a user