From 47af201e7609a10f589959ef26fff135d016991a Mon Sep 17 00:00:00 2001 From: Neil Hanlon Date: Thu, 22 Feb 2024 16:30:17 -0500 Subject: [PATCH] add meeting minutes for 2024-02-08, 2024-02-22 --- docs/events/meeting-notes/2024-02-08.md | 35 +++++++++++++ docs/events/meeting-notes/2024-02-22.md | 66 +++++++++++++++++++++++++ 2 files changed, 101 insertions(+) create mode 100644 docs/events/meeting-notes/2024-02-08.md create mode 100644 docs/events/meeting-notes/2024-02-22.md diff --git a/docs/events/meeting-notes/2024-02-08.md b/docs/events/meeting-notes/2024-02-08.md new file mode 100644 index 0000000..9060899 --- /dev/null +++ b/docs/events/meeting-notes/2024-02-08.md @@ -0,0 +1,35 @@ +# SIG/HPC Meeting 2024-02-08 + +## Attendees + +* Sherif Nagy +* Neil Hanlon +* Chris Simmons + +(Neil forgot to take attendence) + +## Follow Ups + +* Slurm 23.11.5 in production + * Adjust conflicts and provides for older packages +* Meeting with intel on Monday re: GPU drivers; need insight on testin + * Monday @4PM Eastern (?) - chris will invite NEil + * Secureboot support? + * Driver is fully open source +* no update from chris on PMIX +* no movement on Lustre filesystem yet +* Neil to put in [tickets](https://git.resf.org/sig_hpc/meta/issues) actually for [[Meeting/2024-01-25/Rocky/SIG/HPC|Last Meeting]] +* Brian Phan and Forrest Burt gave talks on Warewulf/Apptainer +* Sherif and Brian met up at FOSDEM and discussed testing for WW, and what we can/should test + +## Discussions + +* (Neil had to leave early) + +### Open Floor + +* N/A + +### Action Items + +* N/A diff --git a/docs/events/meeting-notes/2024-02-22.md b/docs/events/meeting-notes/2024-02-22.md new file mode 100644 index 0000000..9d87723 --- /dev/null +++ b/docs/events/meeting-notes/2024-02-22.md @@ -0,0 +1,66 @@ +# SIG/HPC Meeting 2024-02-22 + +## Attendees + +* Sherif Nagy +* Neil Hanlon +* Alan Marshall +* Brian Peters +* Chris Simmons +* Brian Phan +* Forrest Burt + +## Follow Ups + +* NVIDIA GPU driver Testing - Chris + * https://github.com/mghpcsim/gpu-testing/tree/master + * documented process for configuring instance, installing drivers (open source or proprietary), setting up container runtimes, nvidia container toolkit + * Benchmarks using forked toolkit from Lambda labs with Rocky customizations + * initial control benchmark (pytorch): + * closed drivers slightly (4s) faster + * Plan: run benchmarks on progressively newer instances and collect results + * Publish results on Wiki +* Intel driver - Met with them, went well + * Can build this driver into signed kernel modules, add to testing Chris is doing + * This will live in SIG/Kernel because it's a kernel module + * driver toolkit pieces probably will end up in HPC SIG +* Kernel Cnode (for MoS) + * Sherif synced with Jeremy + * Lots of progress has been made, almost all patches backported + * there are couple problematic patches--they're based on SLES kernels, but a bit different enough to be problematic + * Pablo will help once the problem is set + +## Discussions + +* Testing - Warewulf, others + * Sherif and Brian Phan synced on warewulf testing + * Not *just* installibility, upgrade path, etc + * What can we use? Multiple things, probably + * OpenQA? TMT? Zuul? Whatever OpenHPC uses? + * Testing team would also love to get more people involved and participating in building tests + * Example tests: + * Provision cluster + * nodes communicate + * etc + * Want: have full end to end testing of all components + * What tests do we want? + * Functional + * Create cluster + * Create user + * Submit job as user + * Future: + * Slurm accounting/dbd, others +* Package tracking - PoI tracker + * Neil is looking how we can integrate this +* Wiki Updates - Neil and Sherif will work on this at FOSDEM/CentOS Connect + * This didn't really happen specifically, but discussions about ensuring Wikis are up to date did happen + +### Open Floor + +* N/A + +### Action Items + +* Sherif to build and release intel driver +* Sherif and Brian to work on defining tests that we want to run +* Neil to work on package update notifications