alchemiscale.org
compute resources status DD – Plenty of capacity available. Compute resources are idling, nothing running currently. JS – I’ll submit lots of stuff later this week. Will be public.
current stack versions:
JS – Charge setting through OpenFE JS – Core issue about charging is here: https://openforcefieldgroup.slack.com/archives/C02GG1D331P/p1716285193420149 JS – This seems to primarily be an issue on NRP. HMO expects that this will happen as much, if not more, on F@H. From a user perspective, on NRP about 20-30% of our tasks fail. There’s plenty of compute, but each time 20-30% of calcs fail, so I’m having to repeatedly restart jobs to approach completion. My proposal would be to have a metadata file that maps errors to resolutions, for example giving the AMBER error would cause a compute node to be blacklisted. Are there plans for this? DD – We do indeed have some open issues on this, I think I’ve triaged them for an upcoming release (not the OpenFE/GUFE 1.0-supporting release), but the one after it (0.6). Alchemiscale #258 is to track compute servers that consistently fail. JS – I think this is slightly different - I’d propose explicitly checking the contents of an error and mapping it to a resolution. DD – I have another issue for that (mapping regex error matches to retries). Hard to track the identity of compute nodes… right now they’re UUID based which have an element of randomness… Could make them deterministic but would be tricky on container orchestrators like NRP where the pods themselves kinda don’t know the machine they’re running on. JS – It’s good that this batch of issues is going to have about the same timing as F@H deployment - There will be new issues on F@H so this will give us flexibility to handle them as they come up. DD – This conversation is giving me ideas on how to expand on alchemiscale #211 / create a new issue. (will probably be alchemiscale 277)
IA – Re: charge setting, DD – GUFE #322 discusses cases where pickled RDMols don’t by default keep charges. It also presents a resolution where we tell RDKit to include charges in pickling. IA – What confuses me here is that we’ve already validated user charges - This is how we did the NAGL benchmarks. DD – In alchemiscale-fah, we use multiprocessing, which requires things to get pickled to be distributed to workers. Whereas in standard operations we don’t pickle. JH did something different. JH – We just do multiprocessing in the alchemiscale client… I don’t think this affects the calcs, but it DOES … JS – The multiprocessing in the alchemisclae client was introduced AFTER the OpenFE NAGL benchmark. … (See recording, ~26 mins) … (resuming at ~30 mins) IA – This seems like a reasonable fix. Does anyone see issues? JW – I'm moderately experienced with RDKit/molecule representations and don’t see any obvious issues with this approach. IA – Any suggesions for how to test this? IA – Timeline? DD – I have an immediate workaround (the code block in GUFE #322), so this isn’t super urgent. IA – I’ll plan to put this in our next GUFE release. (DD made a milestone for GUFE 1.1 and added this) IA – We should also get JH’s other fix (atomicproperties being lost) into this release.
JS – Also seems to be a kind of issue with OpenFE overwriting/losing some user info. Â
DD: working with IK on finishing out alchemiscale release 0.5.0 DD : working on testing MIG splitting on Lilac A100 - Haven’t been able to spend a ton of time on this so not much to report.
|