meeting notes 2024-01-11 #23

Merged
neil merged 1 commits from meeting/2024-01-11 into main 2024-01-11 21:32:23 +00:00

View File

@ -0,0 +1,170 @@
# SIG/HPC meeting 2024-01-11
## Attendees:
* Sherif Nagy
* Neil Hanlon
* Matt Bidwell
* Brian Phan
* Forrest Burt
## Follow ups
* Sherif to finish/complete work on the wiki
* still working
* Decide what is being put into cnode kernel, what is being removed - Jeremy
* still working
* Request SB Certs for HPC cnode kernel from Security - Requested
* Requested
* Create POC script to create tickets when new slurm is available - Neil
* Neil will work on it this month
* Change warewulf -> warewulf3 in next openhpc release - Chris
* no update
* Announce meeting cancelations for December - Neil/Sherif
* done
* Look into building pmix4 for rocky and building slurm23.11 w/ pmix support - Sherif
* Follow up with Intel Driver team - Jeremy
* no updates on intel drivers
## Discussions
* slurm23 will have two packages for the different versions
* slurm22 will probably be EoL by upstream
* will create slurm23.11 pacakge to differentiate from slurm, as slurm23.05 is stable
* Nvidia drivers - check on status of open vs closed ones, what can we distribute?
* Intersection with SIG/AI
* Testing of HPC packages - work with Testing team
* smoke tests, ensure clusters work, etc
* Reach out to Sherif if interested in volunteering to work on this
* slurm / pmix5
* there is interest in building against latest PMIX, but the latest (version 5) is broken
* sounding like this is pretty widespread
* no update on this just yet
## Action items
* Update wiki
* Refine package list of what the SIG publishes, how to use them
* Some packages are up for grabs, recruit folks to contribute
* Maybe make tickets for these so people can claim them?
* Cnode Kernel - no movement
* Request SB Certs for HPC cnode kernel from Security - Requested
* Investigate what resources are required for testing - Neil
* Create POC script to create tickets when new slurm is available - Neil
* Sherif waiting to hear from Jeremy about Intel GPU drivers
## Old business
### 2023-12-14
* No movement on wiki yet, maybe over break
* Cnode Kernel - no movement
* Compare spec files for Warewulf vs OpenHPC - Done!
* Thank you Sherif
* Building warewulf4 for rocky 8 and rocky 9
* Can we keep this name w/ openhpc?
* Chris - next release will also rename warewulf to warewulf3 to distinguish
* Request SB Certs for HPC cnode kernel from Security - Requested
* Requested - No update
* Investigate what resources are required for testing - Neil
* Not done yet
* Create POC script to create tickets when new slurm is available - Neil
* No movement
* Sherif waiting to hear from Jeremy about Intel GPU drivers
* Should have heard from them. Jeremy will follow up with them to see what happened
### 2023-11-30
* Sherif to finish/complete work on the wiki
* Not done
* Decide what is being put into cnode kernel, what is being removed - Jeremy
* No updates
* Request SB Certs for HPC cnode kernel from Security - Requested
* Requested
* Compare spec files for Warewulf vs OpenHPC - Sherif
* Not done yet
* Investigate what resources are required for testing - Neil
* Not done yet
* Create POC script to create tickets when new slurm is available - Neil
### 2023-11-15
* Sherif to finish/complete work on the wiki
* Not done
* Sherif to add Jeremy and Chris to gitusers and sig_hpc - Done
* Decide what is being put into cnode kernel, what is being removed - Jeremy
* No updates
* Request SB Certs for HPC cnode kernel from Security - Requested
* Requested
* Compare spec files for Warewulf vs OpenHPC - Sherif
* Not done yet
* Investigate what resources are required for testing - Neil
* Not done yet
### 2023-11-02
* Sherif to work on abit on the wiki - Not done
* Sherif to add Jeremy and Chris to the git user groups
### 2023-10-19:
* Sherif to create kernel repo for kernel HPC, kernel-hpc-node, called now kernel-cnode - Done -
* Jeermy, to get the ball rolling with intel GPU driver
* Stack, Fix the slurm rest daemon and integrated it with openQA
### 2023-10-05:
* None for this meeting, however we should be working on old business action items
### 2023-09-21:
* Sherif: Get the SIG for drivers
* Sherif: Check the names of nvidia drivers "open , dkms and closed source"
* Chris: Bench mark nvidia open vs closed source
### 2023-09-07:
* Sherif: Reaching out to AI SIG to check on hosting nvida that drivers that CIQ would like to contribute - Done and waiting to hear from them -
### 2023-08-24:
* Sherif: To push the testing repo file to release package
* Sherif: testing / merging the_real_swa scripts
### 2023-08-10:
* Sherif: Looking into the openQA testing - Pending
### 2023-07-27:
* Sherif: Reach out to jose-d about pmix - Done, no feedback yet -
* Greg: to reach out to openPBS and cloud charly
* Sherif: To update slurm23 to latest - Done -
### 2023-07-13:
* Sherif needs to update the wiki - Done
* Sherif to look into MPI stack
* Chris will send Sherif a link with intro
### 2023-06-29:
* Sherif release slurm23 sources - Done
* Stack and Sherif working on the HPC list
* Sherif email Jeremy, the slurm23 source URL - Done
### 2023-06-15:
* Sherif to look int openHPC slurm spec file - Pending on Sherif
* We need to get lists of centres and HPC that are moving to Rocky to make a blog post and PR
### 2023-06-01:
* Get a list of packages from Jeremy to pick up from openHPC - Done
* Greg / Sherif talk in Rocky / RESF about generic SIG for common packages such as chaintools
* Plan the openHPC demo Chris / Sherif - Done
* Finlise the slurm package with naming / configuration - Done
### 2023-05-18:
* Get a demo / technical talk after 4 weeks "Sherif can arrange that with Chris" - Done
* Getting a list of packages that openHPC would like to move to distros "Jeremy will be point of contact if we need those in couple of weeks" - Done
### 2023-05-04
* Start building slurm - On going, a bit slowing down with R9.2 and R8.8 releases, however packages are built, some minor configurations needs to be fixed -
* Start building apptainer - on hold -
* Start building singulartiry - on hold -
* Start building warewulf - on hold -
* Sherif: check about forums - done, we can have our own section if we want, can be discussed over the chat -
### 2023-04-20
* Reach out to other communities “Greg” - on going -
* Reaching out for different sites that uses Rocky for HPC “Stack will ping few of them and others as well -Group effort-”
* Reaching out to hardware vendors - nothing done yet -
* Statistic / public registry for sites / HPC to add themselves if they want - nothing done yet -