From 3ccb7cc9ef8b85cadf4e6d3189dce8e83848ee3c Mon Sep 17 00:00:00 2001 From: Neil Hanlon Date: Thu, 14 Dec 2023 16:38:35 -0500 Subject: [PATCH] meeting notes 2023-12-14 --- docs/events/meeting-notes/2023-12-14.md | 161 ++++++++++++++++++++++++ 1 file changed, 161 insertions(+) create mode 100644 docs/events/meeting-notes/2023-12-14.md diff --git a/docs/events/meeting-notes/2023-12-14.md b/docs/events/meeting-notes/2023-12-14.md new file mode 100644 index 0000000..ba8c891 --- /dev/null +++ b/docs/events/meeting-notes/2023-12-14.md @@ -0,0 +1,161 @@ +# SIG/HPC meeting 2023-12-14 + +## Attendees: + * Sherif Nagy + * Neil Hanlon + * Matt Bidwell + * Rich Adams + * Chris Simmons + * Jeremy Siadal + +## Follow ups + +* No movement on wiki yet, maybe over break +* Cnode Kernel - no movement +* Compare spec files for Warewulf vs OpenHPC - Done! + * Thank you Sherif + * Building warewulf4 for rocky 8 and rocky 9 + * Can we keep this name w/ openhpc? + * Chris - next release will also rename warewulf to warewulf3 to distinguish +* Request SB Certs for HPC cnode kernel from Security - Requested + * Requested - No update +* Investigate what resources are required for testing - Neil + * Not done yet +* Create POC script to create tickets when new slurm is available - Neil + * No movement +* Sherif waiting to hear from Jeremy about Intel GPU drivers + * Should have heard from them. Jeremy will follow up with them to see what happened + +## Discussions + +* last meeting of 2023; skip Dec 28th. next meeting will be Jan 11 -- needs announcing + * Happy holidays! +* slurm naming - slurm22 / slurm23 / slurm24 + * slurm24 is coming out soon + * plan to support whatever schedmd is supporting -- two most recent releases +* Testing resources for SIG/HPC + * NVidia V100s - OK? + * Cannot test MIG (multi-instance GPU) with that device + * Sherif can take some of these after they decomm their current HPC, but not sure on timeframe + * Need a place to host these, maybe RESF can do something + * Neil is tracking this, to have better update in January +* chris is working on testing for different schedulers +* Warewulf -- OpenHPC needs to make naming more consistent + * will remove warewulf4 from builds once it's in Rocky and other openhpc distros + * Rocky not worrying about v3, openhpc will continue providing that +* Slurm / pmix support + * on for rocky 9 branch + * there is a pmix5, but ... it's broken. Chris is looking at this over holiday break + * rocky only has pmix 3.2, so if we need new features we may need to build and release in the SIG + * newer versions (4) are backwards compatible, or, are supposed to be + +## Action items + +* Sherif to finish/complete work on the wiki +* Decide what is being put into cnode kernel, what is being removed - Jeremy +* Request SB Certs for HPC cnode kernel from Security - Requested + * Requested +* Create POC script to create tickets when new slurm is available - Neil +* Change warewulf -> warewulf3 in next openhpc release - Chris +* Announce meeting cancelations for December - Neil/Sherif +* Look into building pmix4 for rocky and building slurm23.11 w/ pmix support - Sherif +* Follow up with Intel Driver team - Jeremy + +## Old business + +### 2023-11-30 + +* Sherif to finish/complete work on the wiki + * Not done +* Decide what is being put into cnode kernel, what is being removed - Jeremy + * No updates +* Request SB Certs for HPC cnode kernel from Security - Requested + * Requested +* Compare spec files for Warewulf vs OpenHPC - Sherif + * Not done yet +* Investigate what resources are required for testing - Neil + * Not done yet +* Create POC script to create tickets when new slurm is available - Neil + +### 2023-11-15 + +* Sherif to finish/complete work on the wiki + * Not done +* Sherif to add Jeremy and Chris to gitusers and sig_hpc - Done +* Decide what is being put into cnode kernel, what is being removed - Jeremy + * No updates +* Request SB Certs for HPC cnode kernel from Security - Requested + * Requested +* Compare spec files for Warewulf vs OpenHPC - Sherif + * Not done yet +* Investigate what resources are required for testing - Neil + * Not done yet + +### 2023-11-02 + * Sherif to work on abit on the wiki - Not done + * Sherif to add Jeremy and Chris to the git user groups + +### 2023-10-19: + * Sherif to create kernel repo for kernel HPC, kernel-hpc-node, called now kernel-cnode - Done - + * Jeermy, to get the ball rolling with intel GPU driver + * Stack, Fix the slurm rest daemon and integrated it with openQA + +### 2023-10-05: + * None for this meeting, however we should be working on old business action items + +### 2023-09-21: + * Sherif: Get the SIG for drivers + * Sherif: Check the names of nvidia drivers "open , dkms and closed source" + * Chris: Bench mark nvidia open vs closed source + +### 2023-09-07: + * Sherif: Reaching out to AI SIG to check on hosting nvida that drivers that CIQ would like to contribute - Done and waiting to hear from them - + +### 2023-08-24: + * Sherif: To push the testing repo file to release package + * Sherif: testing / merging the_real_swa scripts + +### 2023-08-10: + * Sherif: Looking into the openQA testing - Pending + +### 2023-07-27: + * Sherif: Reach out to jose-d about pmix - Done, no feedback yet - + * Greg: to reach out to openPBS and cloud charly + * Sherif: To update slurm23 to latest - Done - + +### 2023-07-13: + * Sherif needs to update the wiki - Done + * Sherif to look into MPI stack + * Chris will send Sherif a link with intro + +### 2023-06-29: + * Sherif release slurm23 sources - Done + * Stack and Sherif working on the HPC list + * Sherif email Jeremy, the slurm23 source URL - Done + +### 2023-06-15: + * Sherif to look int openHPC slurm spec file - Pending on Sherif + * We need to get lists of centres and HPC that are moving to Rocky to make a blog post and PR + +### 2023-06-01: + * Get a list of packages from Jeremy to pick up from openHPC - Done + * Greg / Sherif talk in Rocky / RESF about generic SIG for common packages such as chaintools + * Plan the openHPC demo Chris / Sherif - Done + * Finlise the slurm package with naming / configuration - Done + +### 2023-05-18: + * Get a demo / technical talk after 4 weeks "Sherif can arrange that with Chris" - Done + * Getting a list of packages that openHPC would like to move to distros "Jeremy will be point of contact if we need those in couple of weeks" - Done + +### 2023-05-04 + * Start building slurm - On going, a bit slowing down with R9.2 and R8.8 releases, however packages are built, some minor configurations needs to be fixed - + * Start building apptainer - on hold - + * Start building singulartiry - on hold - + * Start building warewulf - on hold - + * Sherif: check about forums - done, we can have our own section if we want, can be discussed over the chat - + +### 2023-04-20 + * Reach out to other communities “Greg” - on going - + * Reaching out for different sites that uses Rocky for HPC “Stack will ping few of them and others as well -Group effort-” + * Reaching out to hardware vendors - nothing done yet - + * Statistic / public registry for sites / HPC to add themselves if they want - nothing done yet -