# SIG/HPC Meeting 2024-02-22 ## Attendees * Sherif Nagy * Neil Hanlon * Alan Marshall * Brian Peters * Chris Simmons * Brian Phan * Forrest Burt ## Follow Ups * NVIDIA GPU driver Testing - Chris * https://github.com/mghpcsim/gpu-testing/tree/master * documented process for configuring instance, installing drivers (open source or proprietary), setting up container runtimes, nvidia container toolkit * Benchmarks using forked toolkit from Lambda labs with Rocky customizations * initial control benchmark (pytorch): * closed drivers slightly (4s) faster * Plan: run benchmarks on progressively newer instances and collect results * Publish results on Wiki * Intel driver - Met with them, went well * Can build this driver into signed kernel modules, add to testing Chris is doing * This will live in SIG/Kernel because it's a kernel module * driver toolkit pieces probably will end up in HPC SIG * Kernel Cnode (for MoS) * Sherif synced with Jeremy * Lots of progress has been made, almost all patches backported * there are couple problematic patches--they're based on SLES kernels, but a bit different enough to be problematic * Pablo will help once the problem is set ## Discussions * Testing - Warewulf, others * Sherif and Brian Phan synced on warewulf testing * Not *just* installibility, upgrade path, etc * What can we use? Multiple things, probably * OpenQA? TMT? Zuul? Whatever OpenHPC uses? * Testing team would also love to get more people involved and participating in building tests * Example tests: * Provision cluster * nodes communicate * etc * Want: have full end to end testing of all components * What tests do we want? * Functional * Create cluster * Create user * Submit job as user * Future: * Slurm accounting/dbd, others * Package tracking - PoI tracker * Neil is looking how we can integrate this * Wiki Updates - Neil and Sherif will work on this at FOSDEM/CentOS Connect * This didn't really happen specifically, but discussions about ensuring Wikis are up to date did happen ### Open Floor * N/A ### Action Items * Sherif to build and release intel driver * Sherif and Brian to work on defining tests that we want to run * Neil to work on package update notifications