Notes |
---|
|
|
|
JC : how a Protocol could be launched on a node that has multiple GPUs available in a manner that it could launch multiple processes via e.g. MPI to reduce wall clock times. So the protocol would need to say how many processes can be run, and the alchemiscale worker would need to be able to hear this and make a decision. This isn’t a huge priority but it would give us big speedups. DD – Potential speedup? Single node? JC – Linear with nGPUs. Single node. DD – This would require a compute service that can sit on top of a node and count how many GPUs it has access to. JC – It could be a commandline argument, like “use this many processes per task” DD – How would it spread itself across multiple GPUs. JC – We could make it so each process get a differnt CUDA_VISIBLE_DEVICES mask. IA – There’s a lot of stuff in OMMTools that call mpiplus, is this already baked into OMMTools? JC – Yes, you’d initiate the run inside of the mpi process (like inside mpirun ) IA – Could I have a look at an example? JC – Yes, right now this just uses GPU device 0. There’s an option to use a config file. It’s complicated because it’s meant to run on nodes with different number of available GPUs. We can also simplify this in the future using something like DASK. IA – From an OpenFE perspective, we’d be keen for this support. Our current compute resources JW – Can we run single-GPU jobs currently? And just do multiple of them on the same node, with each using one GPU? DD – Yes. But there’s an advantage here where some protocols can oversubscribe GPUs (like, submit multiple sims to one GPU and …) – OpenFE repex protocol consumes all of whatrever GPU it’s given. The noneqcycling protocol in perses doesn’t fully utilize a gpu so it can oversubscribe successfully. I think what JC’s asking for is “how can we take advantage of multiple GPUs on one node?” IP – For example, in IZ’s work on protein mutations, we’re using two GPUs, and we want to fully utilize those. So she was running two replicas at the same time on two GPUs. DD – Ok, so that wouldn’t work currently. But what the alchemiscale architecture would let us do is define a new type of compute service that sits on an entire node (or multiple GPUs) and does things cleverly. We’d discussed this before but hadn’t made plans to take action on this. IP – … IA – Could we move this to a power hour? I’m not sure where this would live. DD – Agree, I’d like to determine where the complexity would live. RG is out this week and DS is out next week? IA – When would be be moving on this? JW – I wouldn’t really approve this before the F@H work completes, so this isn’t very urgent. IA – I’ll bring this up as a discussion topic in the Aug 17 OpenFE power hour, once everyone’s back.
|
|
|
|
|
|
|