Compare commits

..

No commits in common. "main" and "init-pages" have entirely different histories.

9 changed files with 0 additions and 825 deletions

View File

@ -1,125 +0,0 @@
# SIG/HPC meeting 2023-11-15
## Attendees:
* Sherif Nagy
* Neil Hanlon
* Chris McGuire
* Jeremy Siadal - Intel
* Filip Hans Polbratt - NSC
* Dr. Chris Simmons - MGHPCC
## Discussions:
* Jeremy - Intel group quite interested in integrating with Rocky
* Likely internal team will provide drivers, but Jeremy will handle the spec file
* Kernel cnode - repo is created and should be ready
* Should be based off Rocky's base kernel, not upstream ML/LTS
* MOS - Multi Operating System
* LWK - Lightweight Kernel Project
* Will need to patch into the scheduler
* also will strip out everything that's not needed
* what could be put directly in the kernel instead of modules - could also eliminate initrd if needed
* Secureboot?
* This should be OK to do, but we need to make sure it's still compliant with shim
* we are now separating the SB certs by SIG, so we will need to request certs from Security
* MOS - some patches in the scheduler are not applying cleanly due to changes in scheduler code
* CIQ might be able to help investigate the 6 patches
* Intel GPU Driver - Neil didn't catch this
* Warewulf
* Spec files exist in the github for warewulf
* Sherif to investigate
* Testing
* Currently - Testing team handles this manually at this time
* Jeremy would like to have our own test harness
* Suggest TMT for maximum cross polination - Testing team should be working on that
* Neil can work to provide infrastructure for testing
* Must be at least 1 hardware system (x86) with at least two GPUs (Intel)
* Need x86 AND arm for Nvidia - A30, etc - 4x cards
* need infiniband (and SR/IOV support, and cross talk)
* Links
* https://tmt.readthedocs.io/en/latest/spec.html
* https://docs.fedoraproject.org/en-US/ci/tmt/
* Cloud RDMA - GCP
* Send this to Kernel SIG :)
## Action items:
* Sherif to finish/complete work on the wiki
* Sherif to add Jeremy and Chris to gitusers and sig_hpc
* Neil added ohpcsim and jcsiadal to gitusers. Sherif will add to gitusers
* Decide what is being put into cnode kernel, what is being removed - Jeremy
* Request SB Certs for HPC cnode kernel from Security
* Compare spec files for Warewulf vs OpenHPC - Sherif
* Investigate what resources are required for testing - Neil
## Old business:
## 2023-11-02
* Sherif to work on abit on the wiki - Not done
* Sherif to add Jeremy and Chris to the git user groups
## 2023-10-19:
* Sherif to create kernel repo for kernel HPC, kernel-hpc-node, called now kernel-cnode - Done -
* Jeermy, to get the ball rolling with intel GPU driver
* Stack, Fix the slurm rest daemon and integrated it with openQA
## 2023-10-05:
* None for this meeting, however we should be working on old business action items
## 2023-09-21:
* Sherif: Get the SIG for drivers
* Sherif: Check the names of nvidia drivers "open , dkms and closed source"
* Chris: Bench mark nvidia open vs closed source
## 2023-09-07:
* Sherif: Reaching out to AI SIG to check on hosting nvida that drivers that CIQ would like to contribute - Done and waiting to hear from them -
## 2023-08-24:
* Sherif: To push the testing repo file to release package
* Sherif: testing / merging the_real_swa scripts
## 2023-08-10:
* Sherif: Looking into the openQA testing - Pending
## 2023-07-27:
* Sherif: Reach out to jose-d about pmix - Done, no feedback yet -
* Greg: to reach out to openPBS and cloud charly
* Sherif: To update slurm23 to latest - Done -
## 2023-07-13:
* Sherif needs to update the wiki - Done
* Sherif to look into MPI stack
* Chris will send Sherif a link with intro
## 2023-06-29:
* Sherif release slurm23 sources - Done
* Stack and Sherif working on the HPC list
* Sherif email Jeremy, the slurm23 source URL - Done
## 2023-06-15:
* Sherif to look int openHPC slurm spec file - Pending on Sherif
* We need to get lists of centres and HPC that are moving to Rocky to make a blog post and PR
## 2023-06-01:
* Get a list of packages from Jeremy to pick up from openHPC - Done
* Greg / Sherif talk in Rocky / RESF about generic SIG for common packages such as chaintools
* Plan the openHPC demo Chris / Sherif - Done
* Finlise the slurm package with naming / configuration - Done
## 2023-05-18:
* Get a demo / technical talk after 4 weeks "Sherif can arrange that with Chris" - Done
* Getting a list of packages that openHPC would like to move to distros "Jeremy will be point of contact if we need those in couple of weeks" - Done
## 2023-05-04
* Start building slurm - On going, a bit slowing down with R9.2 and R8.8 releases, however packages are built, some minor configurations needs to be fixed -
* Start building apptainer - on hold -
* Start building singulartiry - on hold -
* Start building warewulf - on hold -
* Sherif: check about forums - done, we can have our own section if we want, can be discussed over the chat -
## 2023-04-20
* Reach out to other communities “Greg” - on going -
* Reaching out for different sites that uses Rocky for HPC “Stack will ping few of them and others as well -Group effort-”
* Reaching out to hardware vendors - nothing done yet -
* Statistic / public registry for sites / HPC to add themselves if they want - nothing done yet -

View File

@ -1,118 +0,0 @@
# SIG/HPC meeting 2023-11-30
## Attendees:
* Sherif Nagy
* Neil Hanlon
* Matt Bidwell
## Discussions:
* Testing infrastructure
* We can get a couple graphics cards, donated from ICHEC, needs to wait for decom
* Neil will follow up and start creating a plan
* Automation for slurm and other packages
* Create a script to create a ticket when an update is available for a package
* Query release-monitoring.org API for available updates
* No updates on cnode kernel yet
* Question on EoL/unsupported artifacts -- would we remove RPM/sources which we know have security vulnerabilities?
* schedmd does pull the bad versions' source and rpms, e.g.
* A: we don't have really any obligation to remove old artifacts which might have very vulnerable code, as we can't really control anything about the user's system aside from providing the latest, fixed artifacts
## Action items:
* Sherif to finish/complete work on the wiki
* Not done
* Decide what is being put into cnode kernel, what is being removed - Jeremy
* No updates
* Request SB Certs for HPC cnode kernel from Security - Requested
* Requested
* Compare spec files for Warewulf vs OpenHPC - Sherif
* Not done yet
* Investigate what resources are required for testing - Neil
* Not done yet
* Create POC script to create tickets when new slurm is available - Neil
## Old business:
## 2023-11-15
* Sherif to finish/complete work on the wiki
* Not done
* Sherif to add Jeremy and Chris to gitusers and sig_hpc - Done
* Decide what is being put into cnode kernel, what is being removed - Jeremy
* No updates
* Request SB Certs for HPC cnode kernel from Security - Requested
* Requested
* Compare spec files for Warewulf vs OpenHPC - Sherif
* Not done yet
* Investigate what resources are required for testing - Neil
* Not done yet
## 2023-11-02
* Sherif to work on abit on the wiki - Not done
* Sherif to add Jeremy and Chris to the git user groups
## 2023-10-19:
* Sherif to create kernel repo for kernel HPC, kernel-hpc-node, called now kernel-cnode - Done -
* Jeermy, to get the ball rolling with intel GPU driver
* Stack, Fix the slurm rest daemon and integrated it with openQA
## 2023-10-05:
* None for this meeting, however we should be working on old business action items
## 2023-09-21:
* Sherif: Get the SIG for drivers
* Sherif: Check the names of nvidia drivers "open , dkms and closed source"
* Chris: Bench mark nvidia open vs closed source
## 2023-09-07:
* Sherif: Reaching out to AI SIG to check on hosting nvida that drivers that CIQ would like to contribute - Done and waiting to hear from them -
## 2023-08-24:
* Sherif: To push the testing repo file to release package
* Sherif: testing / merging the_real_swa scripts
## 2023-08-10:
* Sherif: Looking into the openQA testing - Pending
## 2023-07-27:
* Sherif: Reach out to jose-d about pmix - Done, no feedback yet -
* Greg: to reach out to openPBS and cloud charly
* Sherif: To update slurm23 to latest - Done -
## 2023-07-13:
* Sherif needs to update the wiki - Done
* Sherif to look into MPI stack
* Chris will send Sherif a link with intro
## 2023-06-29:
* Sherif release slurm23 sources - Done
* Stack and Sherif working on the HPC list
* Sherif email Jeremy, the slurm23 source URL - Done
## 2023-06-15:
* Sherif to look int openHPC slurm spec file - Pending on Sherif
* We need to get lists of centres and HPC that are moving to Rocky to make a blog post and PR
## 2023-06-01:
* Get a list of packages from Jeremy to pick up from openHPC - Done
* Greg / Sherif talk in Rocky / RESF about generic SIG for common packages such as chaintools
* Plan the openHPC demo Chris / Sherif - Done
* Finlise the slurm package with naming / configuration - Done
## 2023-05-18:
* Get a demo / technical talk after 4 weeks "Sherif can arrange that with Chris" - Done
* Getting a list of packages that openHPC would like to move to distros "Jeremy will be point of contact if we need those in couple of weeks" - Done
## 2023-05-04
* Start building slurm - On going, a bit slowing down with R9.2 and R8.8 releases, however packages are built, some minor configurations needs to be fixed -
* Start building apptainer - on hold -
* Start building singulartiry - on hold -
* Start building warewulf - on hold -
* Sherif: check about forums - done, we can have our own section if we want, can be discussed over the chat -
## 2023-04-20
* Reach out to other communities “Greg” - on going -
* Reaching out for different sites that uses Rocky for HPC “Stack will ping few of them and others as well -Group effort-”
* Reaching out to hardware vendors - nothing done yet -
* Statistic / public registry for sites / HPC to add themselves if they want - nothing done yet -

View File

@ -1,161 +0,0 @@
# SIG/HPC meeting 2023-12-14
## Attendees:
* Sherif Nagy
* Neil Hanlon
* Matt Bidwell
* Rich Adams
* Chris Simmons
* Jeremy Siadal
## Follow ups
* No movement on wiki yet, maybe over break
* Cnode Kernel - no movement
* Compare spec files for Warewulf vs OpenHPC - Done!
* Thank you Sherif
* Building warewulf4 for rocky 8 and rocky 9
* Can we keep this name w/ openhpc?
* Chris - next release will also rename warewulf to warewulf3 to distinguish
* Request SB Certs for HPC cnode kernel from Security - Requested
* Requested - No update
* Investigate what resources are required for testing - Neil
* Not done yet
* Create POC script to create tickets when new slurm is available - Neil
* No movement
* Sherif waiting to hear from Jeremy about Intel GPU drivers
* Should have heard from them. Jeremy will follow up with them to see what happened
## Discussions
* last meeting of 2023; skip Dec 28th. next meeting will be Jan 11 -- needs announcing
* Happy holidays!
* slurm naming - slurm22 / slurm23 / slurm24
* slurm24 is coming out soon
* plan to support whatever schedmd is supporting -- two most recent releases
* Testing resources for SIG/HPC
* NVidia V100s - OK?
* Cannot test MIG (multi-instance GPU) with that device
* Sherif can take some of these after they decomm their current HPC, but not sure on timeframe
* Need a place to host these, maybe RESF can do something
* Neil is tracking this, to have better update in January
* chris is working on testing for different schedulers
* Warewulf -- OpenHPC needs to make naming more consistent
* will remove warewulf4 from builds once it's in Rocky and other openhpc distros
* Rocky not worrying about v3, openhpc will continue providing that
* Slurm / pmix support
* on for rocky 9 branch
* there is a pmix5, but ... it's broken. Chris is looking at this over holiday break
* rocky only has pmix 3.2, so if we need new features we may need to build and release in the SIG
* newer versions (4) are backwards compatible, or, are supposed to be
## Action items
* Sherif to finish/complete work on the wiki
* Decide what is being put into cnode kernel, what is being removed - Jeremy
* Request SB Certs for HPC cnode kernel from Security - Requested
* Requested
* Create POC script to create tickets when new slurm is available - Neil
* Change warewulf -> warewulf3 in next openhpc release - Chris
* Announce meeting cancelations for December - Neil/Sherif
* Look into building pmix4 for rocky and building slurm23.11 w/ pmix support - Sherif
* Follow up with Intel Driver team - Jeremy
## Old business
### 2023-11-30
* Sherif to finish/complete work on the wiki
* Not done
* Decide what is being put into cnode kernel, what is being removed - Jeremy
* No updates
* Request SB Certs for HPC cnode kernel from Security - Requested
* Requested
* Compare spec files for Warewulf vs OpenHPC - Sherif
* Not done yet
* Investigate what resources are required for testing - Neil
* Not done yet
* Create POC script to create tickets when new slurm is available - Neil
### 2023-11-15
* Sherif to finish/complete work on the wiki
* Not done
* Sherif to add Jeremy and Chris to gitusers and sig_hpc - Done
* Decide what is being put into cnode kernel, what is being removed - Jeremy
* No updates
* Request SB Certs for HPC cnode kernel from Security - Requested
* Requested
* Compare spec files for Warewulf vs OpenHPC - Sherif
* Not done yet
* Investigate what resources are required for testing - Neil
* Not done yet
### 2023-11-02
* Sherif to work on abit on the wiki - Not done
* Sherif to add Jeremy and Chris to the git user groups
### 2023-10-19:
* Sherif to create kernel repo for kernel HPC, kernel-hpc-node, called now kernel-cnode - Done -
* Jeermy, to get the ball rolling with intel GPU driver
* Stack, Fix the slurm rest daemon and integrated it with openQA
### 2023-10-05:
* None for this meeting, however we should be working on old business action items
### 2023-09-21:
* Sherif: Get the SIG for drivers
* Sherif: Check the names of nvidia drivers "open , dkms and closed source"
* Chris: Bench mark nvidia open vs closed source
### 2023-09-07:
* Sherif: Reaching out to AI SIG to check on hosting nvida that drivers that CIQ would like to contribute - Done and waiting to hear from them -
### 2023-08-24:
* Sherif: To push the testing repo file to release package
* Sherif: testing / merging the_real_swa scripts
### 2023-08-10:
* Sherif: Looking into the openQA testing - Pending
### 2023-07-27:
* Sherif: Reach out to jose-d about pmix - Done, no feedback yet -
* Greg: to reach out to openPBS and cloud charly
* Sherif: To update slurm23 to latest - Done -
### 2023-07-13:
* Sherif needs to update the wiki - Done
* Sherif to look into MPI stack
* Chris will send Sherif a link with intro
### 2023-06-29:
* Sherif release slurm23 sources - Done
* Stack and Sherif working on the HPC list
* Sherif email Jeremy, the slurm23 source URL - Done
### 2023-06-15:
* Sherif to look int openHPC slurm spec file - Pending on Sherif
* We need to get lists of centres and HPC that are moving to Rocky to make a blog post and PR
### 2023-06-01:
* Get a list of packages from Jeremy to pick up from openHPC - Done
* Greg / Sherif talk in Rocky / RESF about generic SIG for common packages such as chaintools
* Plan the openHPC demo Chris / Sherif - Done
* Finlise the slurm package with naming / configuration - Done
### 2023-05-18:
* Get a demo / technical talk after 4 weeks "Sherif can arrange that with Chris" - Done
* Getting a list of packages that openHPC would like to move to distros "Jeremy will be point of contact if we need those in couple of weeks" - Done
### 2023-05-04
* Start building slurm - On going, a bit slowing down with R9.2 and R8.8 releases, however packages are built, some minor configurations needs to be fixed -
* Start building apptainer - on hold -
* Start building singulartiry - on hold -
* Start building warewulf - on hold -
* Sherif: check about forums - done, we can have our own section if we want, can be discussed over the chat -
### 2023-04-20
* Reach out to other communities “Greg” - on going -
* Reaching out for different sites that uses Rocky for HPC “Stack will ping few of them and others as well -Group effort-”
* Reaching out to hardware vendors - nothing done yet -
* Statistic / public registry for sites / HPC to add themselves if they want - nothing done yet -

View File

@ -1,170 +0,0 @@
# SIG/HPC meeting 2024-01-11
## Attendees:
* Sherif Nagy
* Neil Hanlon
* Matt Bidwell
* Brian Phan
* Forrest Burt
## Follow ups
* Sherif to finish/complete work on the wiki
* still working
* Decide what is being put into cnode kernel, what is being removed - Jeremy
* still working
* Request SB Certs for HPC cnode kernel from Security - Requested
* Requested
* Create POC script to create tickets when new slurm is available - Neil
* Neil will work on it this month
* Change warewulf -> warewulf3 in next openhpc release - Chris
* no update
* Announce meeting cancelations for December - Neil/Sherif
* done
* Look into building pmix4 for rocky and building slurm23.11 w/ pmix support - Sherif
* Follow up with Intel Driver team - Jeremy
* no updates on intel drivers
## Discussions
* slurm23 will have two packages for the different versions
* slurm22 will probably be EoL by upstream
* will create slurm23.11 pacakge to differentiate from slurm, as slurm23.05 is stable
* Nvidia drivers - check on status of open vs closed ones, what can we distribute?
* Intersection with SIG/AI
* Testing of HPC packages - work with Testing team
* smoke tests, ensure clusters work, etc
* Reach out to Sherif if interested in volunteering to work on this
* slurm / pmix5
* there is interest in building against latest PMIX, but the latest (version 5) is broken
* sounding like this is pretty widespread
* no update on this just yet
## Action items
* Update wiki
* Refine package list of what the SIG publishes, how to use them
* Some packages are up for grabs, recruit folks to contribute
* Maybe make tickets for these so people can claim them?
* Cnode Kernel - no movement
* Request SB Certs for HPC cnode kernel from Security - Requested
* Investigate what resources are required for testing - Neil
* Create POC script to create tickets when new slurm is available - Neil
* Sherif waiting to hear from Jeremy about Intel GPU drivers
## Old business
### 2023-12-14
* No movement on wiki yet, maybe over break
* Cnode Kernel - no movement
* Compare spec files for Warewulf vs OpenHPC - Done!
* Thank you Sherif
* Building warewulf4 for rocky 8 and rocky 9
* Can we keep this name w/ openhpc?
* Chris - next release will also rename warewulf to warewulf3 to distinguish
* Request SB Certs for HPC cnode kernel from Security - Requested
* Requested - No update
* Investigate what resources are required for testing - Neil
* Not done yet
* Create POC script to create tickets when new slurm is available - Neil
* No movement
* Sherif waiting to hear from Jeremy about Intel GPU drivers
* Should have heard from them. Jeremy will follow up with them to see what happened
### 2023-11-30
* Sherif to finish/complete work on the wiki
* Not done
* Decide what is being put into cnode kernel, what is being removed - Jeremy
* No updates
* Request SB Certs for HPC cnode kernel from Security - Requested
* Requested
* Compare spec files for Warewulf vs OpenHPC - Sherif
* Not done yet
* Investigate what resources are required for testing - Neil
* Not done yet
* Create POC script to create tickets when new slurm is available - Neil
### 2023-11-15
* Sherif to finish/complete work on the wiki
* Not done
* Sherif to add Jeremy and Chris to gitusers and sig_hpc - Done
* Decide what is being put into cnode kernel, what is being removed - Jeremy
* No updates
* Request SB Certs for HPC cnode kernel from Security - Requested
* Requested
* Compare spec files for Warewulf vs OpenHPC - Sherif
* Not done yet
* Investigate what resources are required for testing - Neil
* Not done yet
### 2023-11-02
* Sherif to work on abit on the wiki - Not done
* Sherif to add Jeremy and Chris to the git user groups
### 2023-10-19:
* Sherif to create kernel repo for kernel HPC, kernel-hpc-node, called now kernel-cnode - Done -
* Jeermy, to get the ball rolling with intel GPU driver
* Stack, Fix the slurm rest daemon and integrated it with openQA
### 2023-10-05:
* None for this meeting, however we should be working on old business action items
### 2023-09-21:
* Sherif: Get the SIG for drivers
* Sherif: Check the names of nvidia drivers "open , dkms and closed source"
* Chris: Bench mark nvidia open vs closed source
### 2023-09-07:
* Sherif: Reaching out to AI SIG to check on hosting nvida that drivers that CIQ would like to contribute - Done and waiting to hear from them -
### 2023-08-24:
* Sherif: To push the testing repo file to release package
* Sherif: testing / merging the_real_swa scripts
### 2023-08-10:
* Sherif: Looking into the openQA testing - Pending
### 2023-07-27:
* Sherif: Reach out to jose-d about pmix - Done, no feedback yet -
* Greg: to reach out to openPBS and cloud charly
* Sherif: To update slurm23 to latest - Done -
### 2023-07-13:
* Sherif needs to update the wiki - Done
* Sherif to look into MPI stack
* Chris will send Sherif a link with intro
### 2023-06-29:
* Sherif release slurm23 sources - Done
* Stack and Sherif working on the HPC list
* Sherif email Jeremy, the slurm23 source URL - Done
### 2023-06-15:
* Sherif to look int openHPC slurm spec file - Pending on Sherif
* We need to get lists of centres and HPC that are moving to Rocky to make a blog post and PR
### 2023-06-01:
* Get a list of packages from Jeremy to pick up from openHPC - Done
* Greg / Sherif talk in Rocky / RESF about generic SIG for common packages such as chaintools
* Plan the openHPC demo Chris / Sherif - Done
* Finlise the slurm package with naming / configuration - Done
### 2023-05-18:
* Get a demo / technical talk after 4 weeks "Sherif can arrange that with Chris" - Done
* Getting a list of packages that openHPC would like to move to distros "Jeremy will be point of contact if we need those in couple of weeks" - Done
### 2023-05-04
* Start building slurm - On going, a bit slowing down with R9.2 and R8.8 releases, however packages are built, some minor configurations needs to be fixed -
* Start building apptainer - on hold -
* Start building singulartiry - on hold -
* Start building warewulf - on hold -
* Sherif: check about forums - done, we can have our own section if we want, can be discussed over the chat -
### 2023-04-20
* Reach out to other communities “Greg” - on going -
* Reaching out for different sites that uses Rocky for HPC “Stack will ping few of them and others as well -Group effort-”
* Reaching out to hardware vendors - nothing done yet -
* Statistic / public registry for sites / HPC to add themselves if they want - nothing done yet -

View File

@ -1,50 +0,0 @@
# SIG/HPC Meeting 2024-01-25
## Attendees
* Sherif Nagy
* Neil Hanlon
* Forrest Burt
* Chris Simmons
## Follow Ups
* Packages
* Slurm23.11 - In staging, needs testing
* This gives us slurm22, slurm23 (with is 23.05), and slurm23.11
* Built with UCX on all except s390x (as UCX is not built for s390x)
* Warewulf4 - published
* Thank you Brian Phan for testing this!
* Lustre - Sherif investigating
* PMIX / slurm23
* Bug reported upstream a few months back, fix available, seems to be working in OpenHPC
* [ ] Chris to track down slurm/pmix on Rocky 8 and see if it's working or not for next meeting
* cNode Kernel
* No updates yet
* SecureBoot Certs - Requested
* [ ] Notification for package updates upstream
* Wiki Updates - Neil and Sherif will work on this at FOSDEM/CentOS Connect
## Discussions
* Next meeting (8 Feb)
* Neil and Sherif traveling back from conferences
* FOSDEM and CentOS Connect
* Forrest and Brian Phan giving presentations on Apptainer/Warewulf
* adrianreber from OpenHPC team will be at FOSDEM
* Neil wants to nag him about a Mirrormanager bug
* Package list - Update
* [ ] Neil to create tickets for documentation on packages we've added, update list of what is yet to come
* Testing
* Brainstorm test scenarios we want to create for slurm, warewulf
* Stack is awol due to 👶, so we have some time to decide what we want to have a clear ask to Testing
### Open Floor
* N/A
### Action Items
* [ ] Chris to track down slurm/pmix on Rocky 8 and see if it's working or not for next meeting
* [ ] Neil to create tickets for documentation on packages we've added, update list of what is yet to come
* [ ] Notification for package updates upstream

View File

@ -1,35 +0,0 @@
# SIG/HPC Meeting 2024-02-08
## Attendees
* Sherif Nagy
* Neil Hanlon
* Chris Simmons
(Neil forgot to take attendence)
## Follow Ups
* Slurm 23.11.5 in production
* Adjust conflicts and provides for older packages
* Meeting with intel on Monday re: GPU drivers; need insight on testin
* Monday @4PM Eastern (?) - chris will invite NEil
* Secureboot support?
* Driver is fully open source
* no update from chris on PMIX
* no movement on Lustre filesystem yet
* Neil to put in [tickets](https://git.resf.org/sig_hpc/meta/issues) actually for [[Meeting/2024-01-25/Rocky/SIG/HPC|Last Meeting]]
* Brian Phan and Forrest Burt gave talks on Warewulf/Apptainer
* Sherif and Brian met up at FOSDEM and discussed testing for WW, and what we can/should test
## Discussions
* (Neil had to leave early)
### Open Floor
* N/A
### Action Items
* N/A

View File

@ -1,66 +0,0 @@
# SIG/HPC Meeting 2024-02-22
## Attendees
* Sherif Nagy
* Neil Hanlon
* Alan Marshall
* Brian Peters
* Chris Simmons
* Brian Phan
* Forrest Burt
## Follow Ups
* NVIDIA GPU driver Testing - Chris
* https://github.com/mghpcsim/gpu-testing/tree/master
* documented process for configuring instance, installing drivers (open source or proprietary), setting up container runtimes, nvidia container toolkit
* Benchmarks using forked toolkit from Lambda labs with Rocky customizations
* initial control benchmark (pytorch):
* closed drivers slightly (4s) faster
* Plan: run benchmarks on progressively newer instances and collect results
* Publish results on Wiki
* Intel driver - Met with them, went well
* Can build this driver into signed kernel modules, add to testing Chris is doing
* This will live in SIG/Kernel because it's a kernel module
* driver toolkit pieces probably will end up in HPC SIG
* Kernel Cnode (for MoS)
* Sherif synced with Jeremy
* Lots of progress has been made, almost all patches backported
* there are couple problematic patches--they're based on SLES kernels, but a bit different enough to be problematic
* Pablo will help once the problem is set
## Discussions
* Testing - Warewulf, others
* Sherif and Brian Phan synced on warewulf testing
* Not *just* installibility, upgrade path, etc
* What can we use? Multiple things, probably
* OpenQA? TMT? Zuul? Whatever OpenHPC uses?
* Testing team would also love to get more people involved and participating in building tests
* Example tests:
* Provision cluster
* nodes communicate
* etc
* Want: have full end to end testing of all components
* What tests do we want?
* Functional
* Create cluster
* Create user
* Submit job as user
* Future:
* Slurm accounting/dbd, others
* Package tracking - PoI tracker
* Neil is looking how we can integrate this
* Wiki Updates - Neil and Sherif will work on this at FOSDEM/CentOS Connect
* This didn't really happen specifically, but discussions about ensuring Wikis are up to date did happen
### Open Floor
* N/A
### Action Items
* Sherif to build and release intel driver
* Sherif and Brian to work on defining tests that we want to run
* Neil to work on package update notifications

View File

@ -1,66 +0,0 @@
# SIG/HPC Meeting 2024-03-07
## Attendees
* Forrest Burt
* Brian Phan
* Sherif Nagy
* Enrico Billi
* Neil Hanlon
* Jeremy Siadal
* Chris Stackpole
## Old Business
* Intel Driver -
* Sherif is working on this, has a prototype, needs DKMS
* Used `make spec` script in the branch to create spec, and import from there
* We think that upstream should adopt a different format/packaging methodology
* Perhaps [packit](https://packit.dev) could be helpful?
* What branch/version to use?
* rhel-specific branches say not to use them; use the 'backports' branches instead
* sherif appears to be in the right place
* Next steps:
* Neil to bring dkms from epel into projects
* Sherif to upload to public location for review and testing
* Jeremy to work on testing with some latest hardware
* AI SIG
* where will userspace tools live? HPC? AI? Both?
* Neil: it should be reasonable for us to have the ability to easily release a package in multiple SIGs
* NVidia GPU driver Testing -
* Did not get time to review [Chris's work](https://github.com/mghpcsim/gpu-testing/tree/master) - will try to review this cycle
* Kernel Cnode / MoS
* re-actioning - Jeremy to work on once he has some time
## New Business
* Testing Warewulf - Brian
* Current plan: put the tests upstream into Warewulf repo, Testing team can pull from / engage with upstream
* What precisely are we going to test?
* Functional/E2E tests -- provision a small cluster, etc (see last week's [discussions](https://sig-hpc.rocky.page/events/meeting-notes/2024-02-22/#discussions))
* Future work can include e.g. slurm
* Chris to check on status of slurm
* Packages to bring in
* [List](https://sig-hpc.rocky.page/packages/) on the wiki; needs updating (along with the rest of the wiki)
* if anyone wants to bring something in, has questions, etc. Please ask/get in touch!
* Neil to update the wiki
## Open Floor
* Vulnerability in [lustre](http://lists.lustre.org/pipermail/lustre-announce-lustre.org/2024/000270.html) - related to user namespaces
* Sherif was working on lustre-server, but it's a beast
* DDN already builds RPMS, but... is it worth it to rebuild vs just use upstream?
* Sherif: thinks it makes sense to rebuild against our specific user/kernel space
* there are lustre-server for 8, but not 9, it appears.. why?
* documentation supports this but again.. why?
* Sherif to look into why lustre-server exists for 8 but not 9
* Next meeting in two weeks on Thursday, March 1
## Action Items
* [ ] Chris to check on status of slurm
* [ ] Neil to update the wiki
* [ ] Sherif to look into why lustre-server exists for 8 but not 9
* [ ] Neil to bring dkms from epel into projects
* [ ] Sherif to upload to public location for review and testing
* [ ] Jeremy to work on testing with some latest hardware

View File

@ -1,34 +0,0 @@
# SIG/HPC Meeting 2024-03-21
## Attendees
* Neil Hanlon
* Sherif Nagy
* Brian Phan
* Forrest Burt
## Follow Ups
* Intel GPU driver imported and built in SIG/Kernel 'kernel-drivers' repo.
* https://dl.rockylinux.org/stg/sig/9/kernel/x86_64/kernel-common/Packages/i/intel-i915-dkms-1.23.6.42.230425.56-1.x86_64.rpm
* Warewulf 4.5 released upstream
* Sherif looking into bringing update to SIG
* Running into issue on Rocky 9
* Testing - CIQ will be upstreaming a test suite
* Nvidia driver GPU benchmarking - re-action reviewing the work
* Did not get time to review [Chris's work](https://github.com/mghpcsim/gpu-testing/tree/master) - will try to review this cycle
* Lustre server
* re-actioning; Sherif has not looked into it yet
* Wiki Content - still need to populate this. Can people from the SIG help?
* Packages - have some 'easy' ones
## Open Floor
* n/a
## Action Items
* [ ] Neil to bring in dkms to kernel-drivers to SIG/Kernel
* [ ] See if Alan would be willing to work on this
* [ ] Neil to look into resourcing some people to work on this
* [ ] Neil to make tickets for all packages we are looking to bring in, rank priority and ease