...
I have finally worked out a similar test on the UCI cluster, where the original errors manifested. I found that running the same test produced 9/10000 crashes (0.09%). Of course, I think this number depends on a lot of things, since it seems to be thread related, and therefore at the hands of the OS scheduler and other related things. Regardless, the segfault has been reproduced on the cluster.
Next is to test the memory across the UCI cluster. These jobs will run for 8+ hours, and are therefore much more expensive. In the 10k segfault test case, I kill the jobs after 60s, so throughput was high. I will run a few (10-100), and see what the memory profiling comes out to. Note that, because I have shown that the default settings will regularly go above the memory limit, I am going to set the safety factor to 50%. In this hypothesis, I expect all to succeed sans the random crashes we get at the start. There will also be some failures since I am running on a pre-emptable queue, and will take this into account.
All but one job was pre-empted, and finished after 152 iterations:
...
So as far as I am concerned, we have this fixed if we reduce the safety factor. Just to confirm, I will submit two jobs with the default safety factor using 16GB and 32GB workers, since these would be the settings used when the crashes occurred.
Running with 16GB, it dies shortly after starting. A nice thing I found, since I am running these interactively, I get this after ending the session:
Code Block |
---|
srun: error: hpc3-15-26: task 0: Out Of Memory |
which is definitely telling. Here is the memory profile:
...
(noting here that I think the PSI memory cap is in units of GB, so I need to scale down by (1000/1024)^3 to make better sense of it)
and the memory profile when I set the Psi4 memory to 32GB:
Code Block |
---|
srun: error: hpc3-15-09: task 0: Out Of Memory |
...