Merge pull request 'add notes for 2023-11-15' (#20) from meeting-nov15 into main
All checks were successful
mkdocs build / build (push) Successful in 29s

Reviewed-on: #20
This commit is contained in:
Neil Hanlon 2023-11-15 21:49:36 +00:00
commit 3cd70d1bb6

View File

@ -0,0 +1,125 @@
# SIG/HPC meeting 2023-11-15
## Attendees:
* Sherif Nagy
* Neil Hanlon
* Chris McGuire
* Jeremy Siadal - Intel
* Filip Hans Polbratt - NSC
* Dr. Chris Simmons - MGHPCC
## Discussions:
* Jeremy - Intel group quite interested in integrating with Rocky
* Likely internal team will provide drivers, but Jeremy will handle the spec file
* Kernel cnode - repo is created and should be ready
* Should be based off Rocky's base kernel, not upstream ML/LTS
* MOS - Multi Operating System
* LWK - Lightweight Kernel Project
* Will need to patch into the scheduler
* also will strip out everything that's not needed
* what could be put directly in the kernel instead of modules - could also eliminate initrd if needed
* Secureboot?
* This should be OK to do, but we need to make sure it's still compliant with shim
* we are now separating the SB certs by SIG, so we will need to request certs from Security
* MOS - some patches in the scheduler are not applying cleanly due to changes in scheduler code
* CIQ might be able to help investigate the 6 patches
* Intel GPU Driver - Neil didn't catch this
* Warewulf
* Spec files exist in the github for warewulf
* Sherif to investigate
* Testing
* Currently - Testing team handles this manually at this time
* Jeremy would like to have our own test harness
* Suggest TMT for maximum cross polination - Testing team should be working on that
* Neil can work to provide infrastructure for testing
* Must be at least 1 hardware system (x86) with at least two GPUs (Intel)
* Need x86 AND arm for Nvidia - A30, etc - 4x cards
* need infiniband (and SR/IOV support, and cross talk)
* Links
* https://tmt.readthedocs.io/en/latest/spec.html
* https://docs.fedoraproject.org/en-US/ci/tmt/
* Cloud RDMA - GCP
* Send this to Kernel SIG :)
## Action items:
* Sherif to finish/complete work on the wiki
* Sherif to add Jeremy and Chris to gitusers and sig_hpc
* Neil added ohpcsim and jcsiadal to gitusers. Sherif will add to gitusers
* Decide what is being put into cnode kernel, what is being removed - Jeremy
* Request SB Certs for HPC cnode kernel from Security
* Compare spec files for Warewulf vs OpenHPC - Sherif
* Investigate what resources are required for testing - Neil
## Old business:
## 2023-11-02
* Sherif to work on abit on the wiki - Not done
* Sherif to add Jeremy and Chris to the git user groups
## 2023-10-19:
* Sherif to create kernel repo for kernel HPC, kernel-hpc-node, called now kernel-cnode - Done -
* Jeermy, to get the ball rolling with intel GPU driver
* Stack, Fix the slurm rest daemon and integrated it with openQA
## 2023-10-05:
* None for this meeting, however we should be working on old business action items
## 2023-09-21:
* Sherif: Get the SIG for drivers
* Sherif: Check the names of nvidia drivers "open , dkms and closed source"
* Chris: Bench mark nvidia open vs closed source
## 2023-09-07:
* Sherif: Reaching out to AI SIG to check on hosting nvida that drivers that CIQ would like to contribute - Done and waiting to hear from them -
## 2023-08-24:
* Sherif: To push the testing repo file to release package
* Sherif: testing / merging the_real_swa scripts
## 2023-08-10:
* Sherif: Looking into the openQA testing - Pending
## 2023-07-27:
* Sherif: Reach out to jose-d about pmix - Done, no feedback yet -
* Greg: to reach out to openPBS and cloud charly
* Sherif: To update slurm23 to latest - Done -
## 2023-07-13:
* Sherif needs to update the wiki - Done
* Sherif to look into MPI stack
* Chris will send Sherif a link with intro
## 2023-06-29:
* Sherif release slurm23 sources - Done
* Stack and Sherif working on the HPC list
* Sherif email Jeremy, the slurm23 source URL - Done
## 2023-06-15:
* Sherif to look int openHPC slurm spec file - Pending on Sherif
* We need to get lists of centres and HPC that are moving to Rocky to make a blog post and PR
## 2023-06-01:
* Get a list of packages from Jeremy to pick up from openHPC - Done
* Greg / Sherif talk in Rocky / RESF about generic SIG for common packages such as chaintools
* Plan the openHPC demo Chris / Sherif - Done
* Finlise the slurm package with naming / configuration - Done
## 2023-05-18:
* Get a demo / technical talk after 4 weeks "Sherif can arrange that with Chris" - Done
* Getting a list of packages that openHPC would like to move to distros "Jeremy will be point of contact if we need those in couple of weeks" - Done
## 2023-05-04
* Start building slurm - On going, a bit slowing down with R9.2 and R8.8 releases, however packages are built, some minor configurations needs to be fixed -
* Start building apptainer - on hold -
* Start building singulartiry - on hold -
* Start building warewulf - on hold -
* Sherif: check about forums - done, we can have our own section if we want, can be discussed over the chat -
## 2023-04-20
* Reach out to other communities “Greg” - on going -
* Reaching out for different sites that uses Rocky for HPC “Stack will ping few of them and others as well -Group effort-”
* Reaching out to hardware vendors - nothing done yet -
* Statistic / public registry for sites / HPC to add themselves if they want - nothing done yet -