From cfa0e4d7145e83354dd429331c56866e27a3aadb Mon Sep 17 00:00:00 2001 From: Neil Hanlon Date: Wed, 15 Nov 2023 14:48:12 -0700 Subject: [PATCH] add notes for 2023-11-15 --- docs/events/meeting-notes/2023-11-15.md | 125 ++++++++++++++++++++++++ 1 file changed, 125 insertions(+) create mode 100644 docs/events/meeting-notes/2023-11-15.md diff --git a/docs/events/meeting-notes/2023-11-15.md b/docs/events/meeting-notes/2023-11-15.md new file mode 100644 index 0000000..7b7d3f3 --- /dev/null +++ b/docs/events/meeting-notes/2023-11-15.md @@ -0,0 +1,125 @@ +# SIG/HPC meeting 2023-11-15 + +## Attendees: + * Sherif Nagy + * Neil Hanlon + * Chris McGuire + * Jeremy Siadal - Intel + * Filip Hans Polbratt - NSC + * Dr. Chris Simmons - MGHPCC + +## Discussions: + +* Jeremy - Intel group quite interested in integrating with Rocky + * Likely internal team will provide drivers, but Jeremy will handle the spec file + +* Kernel cnode - repo is created and should be ready + * Should be based off Rocky's base kernel, not upstream ML/LTS + * MOS - Multi Operating System + * LWK - Lightweight Kernel Project + * Will need to patch into the scheduler + * also will strip out everything that's not needed + * what could be put directly in the kernel instead of modules - could also eliminate initrd if needed + * Secureboot? + * This should be OK to do, but we need to make sure it's still compliant with shim + * we are now separating the SB certs by SIG, so we will need to request certs from Security + * MOS - some patches in the scheduler are not applying cleanly due to changes in scheduler code + * CIQ might be able to help investigate the 6 patches +* Intel GPU Driver - Neil didn't catch this +* Warewulf + * Spec files exist in the github for warewulf + * Sherif to investigate +* Testing + * Currently - Testing team handles this manually at this time + * Jeremy would like to have our own test harness + * Suggest TMT for maximum cross polination - Testing team should be working on that + * Neil can work to provide infrastructure for testing + * Must be at least 1 hardware system (x86) with at least two GPUs (Intel) + * Need x86 AND arm for Nvidia - A30, etc - 4x cards + * need infiniband (and SR/IOV support, and cross talk) + * Links + * https://tmt.readthedocs.io/en/latest/spec.html + * https://docs.fedoraproject.org/en-US/ci/tmt/ +* Cloud RDMA - GCP + * Send this to Kernel SIG :) + +## Action items: + +* Sherif to finish/complete work on the wiki +* Sherif to add Jeremy and Chris to gitusers and sig_hpc + * Neil added ohpcsim and jcsiadal to gitusers. Sherif will add to gitusers +* Decide what is being put into cnode kernel, what is being removed - Jeremy +* Request SB Certs for HPC cnode kernel from Security +* Compare spec files for Warewulf vs OpenHPC - Sherif +* Investigate what resources are required for testing - Neil + +## Old business: + +## 2023-11-02 + * Sherif to work on abit on the wiki - Not done + * Sherif to add Jeremy and Chris to the git user groups + +## 2023-10-19: + * Sherif to create kernel repo for kernel HPC, kernel-hpc-node, called now kernel-cnode - Done - + * Jeermy, to get the ball rolling with intel GPU driver + * Stack, Fix the slurm rest daemon and integrated it with openQA + +## 2023-10-05: + * None for this meeting, however we should be working on old business action items + +## 2023-09-21: + * Sherif: Get the SIG for drivers + * Sherif: Check the names of nvidia drivers "open , dkms and closed source" + * Chris: Bench mark nvidia open vs closed source + +## 2023-09-07: + * Sherif: Reaching out to AI SIG to check on hosting nvida that drivers that CIQ would like to contribute - Done and waiting to hear from them - + +## 2023-08-24: + * Sherif: To push the testing repo file to release package + * Sherif: testing / merging the_real_swa scripts + +## 2023-08-10: + * Sherif: Looking into the openQA testing - Pending + +## 2023-07-27: + * Sherif: Reach out to jose-d about pmix - Done, no feedback yet - + * Greg: to reach out to openPBS and cloud charly + * Sherif: To update slurm23 to latest - Done - + +## 2023-07-13: + * Sherif needs to update the wiki - Done + * Sherif to look into MPI stack + * Chris will send Sherif a link with intro + +## 2023-06-29: + * Sherif release slurm23 sources - Done + * Stack and Sherif working on the HPC list + * Sherif email Jeremy, the slurm23 source URL - Done + +## 2023-06-15: + * Sherif to look int openHPC slurm spec file - Pending on Sherif + * We need to get lists of centres and HPC that are moving to Rocky to make a blog post and PR + +## 2023-06-01: + * Get a list of packages from Jeremy to pick up from openHPC - Done + * Greg / Sherif talk in Rocky / RESF about generic SIG for common packages such as chaintools + * Plan the openHPC demo Chris / Sherif - Done + * Finlise the slurm package with naming / configuration - Done + +## 2023-05-18: + * Get a demo / technical talk after 4 weeks "Sherif can arrange that with Chris" - Done + * Getting a list of packages that openHPC would like to move to distros "Jeremy will be point of contact if we need those in couple of weeks" - Done + +## 2023-05-04 + * Start building slurm - On going, a bit slowing down with R9.2 and R8.8 releases, however packages are built, some minor configurations needs to be fixed - + * Start building apptainer - on hold - + * Start building singulartiry - on hold - + * Start building warewulf - on hold - + * Sherif: check about forums - done, we can have our own section if we want, can be discussed over the chat - + +## 2023-04-20 + * Reach out to other communities “Greg” - on going - + * Reaching out for different sites that uses Rocky for HPC “Stack will ping few of them and others as well -Group effort-” + * Reaching out to hardware vendors - nothing done yet - + * Statistic / public registry for sites / HPC to add themselves if they want - nothing done yet -