QCE/Psi4 notes
Intro
We keep getting weird issues with Psi4, which are only magnified by bigger molecules. Here, I examine how psi4 is utilizing memory to see if there are any glaring issues.
These experiments are done using an optimization that consistently fails using normal procedures. The optimization in QCArchive is 34754174
. The “molecule” in question is:
Procedures/Scripts
Using psi4 1.4a3.dev63+afa0c21
I run the following in two terminals, one is the main psi4 running:
bash run-local.sh 34754174
and the other processes the output and generates a plot:
bash makeplot.sh 34754174
this will generate everything in a folder named 34754174
. For the impatient, my plotting terminal actually runs the disk checker (see below), and the plotter periodically:
while sleep 60 ; do bash makeplot.sh 34754174 ; done &
while sleep 1; do
if [ ! -e 34754174/*_psi_scratch ]; then
continue;
fi
echo -n "$(date +%s ) "; du -s 34754174/*_psi_scratch | awk '{print $1}';
done >> 34754174/disk.dat &
In effect, running the first and last box in separate terminals should get you the plots I show below. The plot is 34754174/mem.png
.
Running
Using this modified script from David Dotson to run a task (name it run-local.py
):
Noting that I added local_options={"memory": "16.0", "ncores": 8, "scratch_directory": os.getcwd() }
to be able to control the cpu/mem given to psi4 and the scratch directory since my local machine uses a tmpfs as /tmp. There is also some record of how/what settings I was testing.
I drive this python script with the following bash script, which does the initialization and memory recording (name this run-local.sh
):
Ideally the traps are there to stop pidstat
once the background psi4 job finishes, but it’s a bit finicky (and I haven’t had too many jobs run until completion ). In any case, the pidstat
command will create a memory log of the main process (e.g. the qcengine/geomeTRIC process) and the psi4 subprocess calls.
Plotting
I then have a similar setup for driving the plots. Here is the bash driver (name this makeplot.sh
):
I do the head -n -1
funny business in case buffering produced an incomplete last line, which would make the numpy loader bork.
The python plotter (name this plot.py
):
As can be seen from the plotter, I decided to record disk usage of the psi4 scratch about half way through the experiments. To enable this, I ran this in the background:
Not pretty since I wrote it as a quick one-liner, but works.
I originally wanted to look into the B matrix cache of geomeTRIC, since it is unbounded and for some other experimental jobs I ran into the warning seen below. I used a locally modifed geomeTRIC, which prints out the B matrix cache size whenever it is accessed (https://github.com/leeping/geomeTRIC/blob/master/geometric/internal.py#L1761):
However, since I am able to roughly account for this by recording the memory of the entire QCEngine/geomeTRIC process in these experiments, I avoid its use.
Segfault hunting
This script will run many psi4 processes in parallel (here -P 8
), and kill them if they run for more than 5 seconds:
Some magic had to be done to kill the psi4 subprocess as it wasn’t working with the older version. Should update the other script to use pkill -P
which says to kill the children of the process. It is still finicky for whatever reason at some times, so I have the mop command below.
This is my psi4 mop since it appears some psi4 commands don’t get cleaned up properly:
which basically parses ps
and kills any psi4 process that has run longer than a minute (noting that this is CPU time, not wall time). Not sure if there is a better way to query runtime on processes.
Finally, I run
and then count the number of failures via
Experiments
SCF_MEM_SAFETY_FACTOR
Setting SCF_MEM_SAFETY_FACTOR
is affected by number of cores. If I set 0.38 for 2 cores, the memory usage will be higher if I run the same thing with 8 cores.
This is the error if the safety factor is set too low (.20 when mem=2GB):
The formatted error is:
0.25 works, but goes beyond the 2GB limit.
Here is an example using a safety factor of .20, and .75 (the default) with 4GB
DIIS
Now, I know DIIS saves state, so it might be the cause of the ramp. Turning off, we see exactly no change in memory behavior, other than the iterations are slower and more are needed:
DFT vs HF
Now, B3LYP/DZVP should be pretty stable; is the dispersion affecting this at all? Tried it, no. Just for fun; does using HF change anything? YES!
How about wb97x?
Yes, there is a ramp up (this one failed due to no disk space; currently working with a 18 GB space). However the memory requirements seem to be somewhat higher than b3lyp.
How about blyp? Yes
How about M06? Yes
How about TPSS? Yes
How about MP2? No (but the memory jumps way up after the HF finishes; ran out of disk space causing it to crash)
DF_INTS_NUM_THREADS
What about DF_INTS_NUM_THREADS
since the documentation says it could affect memory issues? Setting to 1 does seem to affect memory (note that all calculations are done using 8 cores):
There seems to only be a constant offset; the difference in the internal iterations is about the same as previous experiments.
Direct vs DF
Tried using SCF_TYPE
as DIRECT
, which turns off density fitting, recommended by Ben Pritchard. Documentation is https://psicode.org/psi4manual/master/scf.html#eri-algorithms
It is slow, and I am not patient enough to let this finish on my desktop as I have other things in the queue:
We see a slight overall ramp which is consistent with the CORE
algorithm shown below. Will try again later when things are idle.
DIRECT
run finishFull Runs
CORE vs DISK algorithm
At this point; I am not sure what other options could affect memory. There seems to be a consistent ramp in memory for DFT; not sure if this is normal or not.
Now, let’s try to take the optimization out many steps using typical settings of 8 cores, 30GB. This was using the CORE
algorithm, meaning it could be done completely in memory. We have:
and it does fail after 76 iterations with:
No memory errors. My machine has 31GB of memory and additional swap space, so if all is well there should have been enough memory to keep going even if virtual memory grabbed more than 31GB (I note that there is ~2GB more virtual than resident on average, so nothing crazy going on here). Let's try using the DISK
algorithm by setting the safety factor lower to .5 and memory to 16 GB. Realistically, this should have zero memory issues; even virtual memory shouldn’t hit the 16 GB limit. Here is what I have so far:
Died at 94, RIP.
Also, oddly, the memory spikes at iteration 75; this spike may have been the cause for the other failure where the memory fluctuations are much higher. Also notice that the memory tends to decrease in this run, whereas it tended to increase in the CORE
algorithm. Quite a difference in behavior.
Checking the output, it looks like everything is perfect, but the energy of the last psi4 run is None
and there is no output from it. Everything seems to point at a psi4 calculation failing to start. Will try plugging in the last schema manually and seeing if psi4 accepts it. Just looking at the last structure shows nothing remarkable.
It does finish successfully using HF:
Trying to load the last qcschema molecule into the task and running it, with some print statements in the QCE Psi4 harness. Noting that execute
is full of context managers and no exception handling. First two iterations went swimmingly, will see how long it lasts.
Segfaults
After spending enough time on this, I noticed that there was a random crash upon starting a job. To investigate, I took the problem input from above (at iteration 94), and ran it for 5 seconds, 10000 times and recorded how many times it segfaulted. It segfaulted 62/10000 times so far (1%) but was near 3% at certain times. Something tells me it is thread related since there was a period of time when too many psi4 jobs were running and casing them all to run at 50% CPU. Anywho, I ran this after modifying QCEngine to dump any and all contents from the Psi4Harness https://github.com/MolSSI/QCEngine/blob/master/qcengine/programs/psi4.py#L217
It should be pretty clear which uncultured barbarian lines I added The point is the dump the output that psi4 outputs directly, then, just because I have no idea what is going on, check it after QCEngine has had a chance to get its hands on it. In this case, I was wise to check both versions:
The segfault is from above is
Notice how this error is nowhere in the parsed data (success=True
oddly enough, so deserialize
seems to be the culprit for swallowing this segfault). Looking 1 nm closer, it looks like the part of the output that has this segfault is completely dropped in all cases, since deserialize
is only taking the output sections. Likely this should be changed since this error will never, ever a see the light of a computer screen.
In any case, more tests are needed to see how prevalent this issue is. For now, a %1 failure rate would tell me that I have a good chance in not finishing an optimization if psi4 is called 100 times (i.e. 100 minimization steps in geomeTRIC). Since the error rate is about %1, I run 1000 iterations using the original input, which should be no issue. I wasn’t getting many issues when each process had about 1 CPU, when I reduce the number to 4 concurrent tests, I see segfaults again, at this point 2/64.
Testing with 1 core jobs, appears to be smooth sailing… so if the last paragraph read like spaghetti, I can summarize by saying using multiple threads in psi4 causes a random, instant ( < 5 seconds) crash. At this point I’d like to compile my own versions of psi4 and test, but this is getting into areas more for my bemusement than solving QCA issues, which will always use precooked binaries. It does look like an OpenMP parallel section though based on the stack trace. I wonder how kosher the conda supplied libraries are?
I have finally worked out a similar test on the UCI cluster, where the original errors manifested. I found that running the same test produced 9/10000 crashes (0.09%). Of course, I think this number depends on a lot of things, since it seems to be thread related, and therefore at the hands of the OS scheduler and other related things. Regardless, the segfault has been reproduced on the cluster.
Next is to test the memory across the UCI cluster. These jobs will run for 8+ hours, and are therefore much more expensive. In the 10k segfault test case, I kill the jobs after 60s, so throughput was high. I will run a few (100), and see what the memory profiling comes out to. Note that, because I have shown that the default settings will regularly go above the memory limit, I am going to set the safety factor to 50%. In this hypothesis, I expect all to succeed sans the random crashes we get at the start. There will also be some failures since I am running on a pre-emptable queue, and will take this into account.
All but one job was pre-empted, and finished after 152 iterations:
So as far as I am concerned, we have this fixed if we reduce the safety factor. Just to confirm, I will submit two jobs with the default safety factor using 16GB and 32GB workers, since these would be the settings used when the crashes occurred.
Running with 16GB, it dies shortly after starting. A nice thing I found, since I am running these interactively, I get this after ending the session:
which is definitely telling. Here is the memory profile:
and the memory profile when I set the Psi4 memory to 32GB: