...
However, since I am able to roughly account for this by recording the memory of the entire QCEngine/geomeTRIC process in these experiments, I avoid its use.
Experiments
SCF_MEM_SAFETY_FACTOR
Setting SCF_MEM_SAFETY_FACTOR
is affected by number of cores. If I set 0.38 for 2 cores, the memory usage will be higher if I run the same thing with 8 cores.
This is the error if the safety factor is set too low (.20 when mem=2GB):
...
Segfault hunting
This script will run many psi4 processes in parallel (here -P 8
), and kill them if they run for more than 5 seconds:
Code Block |
---|
#!/bin/bash
id=$1
d=$PWD
finish() {
x=$(jobs -p)
[ "$x" ] && { echo $x | xargs -n 1 pkill -P ; }
}
run() {
id=$1
i=$2
rm -rf ${id}.${i}
mkdir -p ${id}.${i}
cd ${id}.${i}
trap finish SIGINT
python3 -u ../run-local.py $id &> output.log &
sleep 5
x=$(jobs -p)
[ "$x" ] && { echo $x | xargs -n 1 pkill -P ; kill $x ; }
rm -rf tmp*
echo "$(date) ${id}.${i} done; failed? $(if grep -q Failed output.log ; then echo -n yes ; else echo -n no ; fi )"
cd ../
}
export -f finish
export -f run
seq 10000 | xargs -P 8 -n 1 -I {} sh -c "run $1 {}"
|
Some magic had to be done to kill the psi4 subprocess as it wasn’t working with the older version. Should update the other script to use pkill -P
which says to kill the children of the process. It is still finicky for whatever reason at some times, so I have the mop command below.
This is my psi4 mop since it appears some psi4 commands don’t get cleaned up properly:
Code Block | ||
---|---|---|
| ||
while sleep 5 ; do
ps aux | grep psi4 | grep msgpack | sed "s/:/ /g" | awk '{if ($11 > 0) print $2 }' | xargs -r kill -9
done |
which basically parses ps
and kills any psi4 process that has run longer than a minute (noting that this is CPU time, not wall time). Not sure if there is a better way to query runtime on processes.
Finally, I run
Code Block |
---|
bash run-local.sh 34754174 |& tee segfaults.log & |
and then count the number of failures via
Code Block |
---|
grep -c yes segfaults.log # for the ones that fail
wc -l < segfaults.log # for the total tries |
Experiments
SCF_MEM_SAFETY_FACTOR
Setting SCF_MEM_SAFETY_FACTOR
is affected by number of cores. If I set 0.38 for 2 cores, the memory usage will be higher if I run the same thing with 8 cores.
This is the error if the safety factor is set too low (.20 when mem=2GB):
Code Block |
---|
FailedOperation(error=ComputeError(error_type='unknown', error_message='geomeTRIC run_json error:\nTraceback (most recent call last):\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/procrouting/scf_proc/scf_iteratorgeometric/run_json.py", line 193214, in scfgeometric_run_initializejson\n self.initialize_jk(self.memory_jk_, jk=jkgeometric.optimize.Optimize(coords, M, IC, engine, None, params)\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/procrouting/scf_proc/scf_iteratorgeometric/optimize.py", line 1281331, in initialize_jkOptimize\n return jkoptimizer.initializeoptimizeGeometry()\nRuntimeError:n \nFatal Error: SCF::DF: Disk based algorithm requires 2 (A|B) fitting metrics and an (A|mn) chunk on core.\n This is 2Q^2 + QNP doubles, where Q is the auxiliary basis size, N is theFile "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1293, in optimizeGeometry\n self.calcEnergyForce()\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1002, in calcEnergyForce\n spcalc = self.engine.calc(self.X, self.dirname)\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/engine.py", line 874, in calc\n return self.calc_new(coords, dirname)\n primary basis size, and P is the maximum number of functions in a primary shell.\n For this problem, that is 374074280 bytes before taxes,534391828 bytes after taxes. \n\nError occurred in file: /scratch/psilocaluser/conda-builds/psi4-multiout_1609773118680/work/psi4/src/psi4/libfock/DiskDFJK.cc on line: 727\nThe most recent 5 function calls were:\n\npsi::DiskDFJK::preiterations()\n\n\n')) |
The formatted error is:
Code Block |
---|
geomeTRIC run_json error: Traceback (most recent call last): File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/engine.py", line 866, in calc_new\n raise QCEngineAPIEngineError("QCEngineAPI computation did not execute correctly. Message: " + ret["error"]["error_message"])\ngeometric.errors.QCEngineAPIEngineError: QCEngineAPI computation did not execute correctly. Message: QCEngine Unknown Error: Traceback (most recent call last):\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/schema_wrapper.py", line 411, in run_qcschema\n ret_data = run_json_qcschema(input_model.dict(), clean, False, keep_wfn=keep_wfn)\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/geometricdriver/runschema_jsonwrapper.py", line 214558, in geometric_run_json_qcschema\n geometric.optimize.Optimize(coordsval, M,wfn IC, engine, None, params) = methods_dict_[json_data["driver"]](method, **kwargs)\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/geometricdriver/optimizedriver.py", line 1331717, in Optimizegradient\n wfn return optimizer.optimizeGeometry() = procedures[\'gradient\'][lowername](lowername, molecule=molecule, **kwargs)\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/geometricprocrouting/optimizeproc.py", line 12932383, in optimizeGeometry run_scf_gradient\n ref_wfn self.calcEnergyForce() = run_scf(name, **kwargs)\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/geometric/optimize.psi4/driver/procrouting/proc.py", line 10022288, in calcEnergyForcerun_scf\n spcalcscf_wfn = self.engine.calc(self.X, self.dirname) scf_helper(name, post_scf=False, **kwargs)\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/geometricprocrouting/engineproc.py", line 8741568, in calcscf_helper\n e_scf return= selfscf_wfn.calccompute_newenergy(coords, dirname) )\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/geometric/enginepsi4/driver/procrouting/scf_proc/scf_iterator.py", line 86683, in calc_newscf_compute_energy\n raise QCEngineAPIEngineError("QCEngineAPI computation did not execute correctly. Message: " + ret["error"]["error_message"]) geometric.errors.QCEngineAPIEngineError: QCEngineAPI computation did not execute correctly. Message: QCEngine Unknown Error: Traceback (most recent call last): File "self.initialize()\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/schema_wrapper/procrouting/scf_proc/scf_iterator.py", line 411193, in runscf_qcschemainitialize\n ret_data = run_json_qcschema(input_model.dict(), clean, False, keep_wfn=keep_wfn) self.initialize_jk(self.memory_jk_, jk=jk)\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/schema_wrapper/procrouting/scf_proc/scf_iterator.py", line 558128, in run_json_qcschema val, wfn = methods_dict_[json_data["driver"]](method, **kwargs) File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/driver.py", line 717, in gradientinitialize_jk\n jk.initialize()\nRuntimeError: \nFatal Error: SCF::DF: Disk based algorithm requires 2 (A|B) fitting metrics and an (A|mn) chunk on core.\n wfn = procedures['gradient'][lowername](lowername, molecule=molecule, **kwargs) File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/procrouting/proc.py", line 2383, in run_scf_gradient ref_wfn = run_scf(name, **kwargs) File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/procrouting/proc.py", line 2288, in run_scf scf_wfn = scf_helper(name, post_scf=False, **kwargs) File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/procrouting/proc.py", line 1568, in scf_helper e_scf = scf_wfn.compute_energy() File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/procrouting/scf_proc/scf_iterator.py", line 83, in scf_compute_energy self.initialize() This is 2Q^2 + QNP doubles, where Q is the auxiliary basis size, N is the\n primary basis size, and P is the maximum number of functions in a primary shell.\n For this problem, that is 374074280 bytes before taxes,534391828 bytes after taxes. \n\nError occurred in file: /scratch/psilocaluser/conda-builds/psi4-multiout_1609773118680/work/psi4/src/psi4/libfock/DiskDFJK.cc on line: 727\nThe most recent 5 function calls were:\n\npsi::DiskDFJK::preiterations()\n\n\n')) |
The formatted error is:
Code Block |
---|
geomeTRIC run_json error: Traceback (most recent call last): File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/procrouting/scf_proc/scf_iteratorgeometric/run_json.py", line 193214, in scfgeometric_run_initializejson self.initialize_jk(self.memory_jk_, jk=jkgeometric.optimize.Optimize(coords, M, IC, engine, None, params) File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/procrouting/scf_proc/scf_iteratorgeometric/optimize.py", line 1281331, in initialize_jkOptimize jkreturn optimizer.initializeoptimizeGeometry() RuntimeError: Fatal Error: SCF::DF: Disk based algorithm requires 2 (A|B) fitting metrics and an (A|mn) chunk on core. File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1293, in optimizeGeometry self.calcEnergyForce() File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1002, in calcEnergyForce spcalc This is 2Q^2 + QNP doubles, where Q is the auxiliary basis size, N is the primary basis size, and P is the maximum number of functions in a primary shell. For this problem, that is 374074280 bytes before taxes,534391828 bytes after taxes. Error occurred in file: /scratch/psilocaluser/conda-builds/psi4-multiout_1609773118680/work/psi4/src/psi4/libfock/DiskDFJK.cc on line: 727 The most recent 5 function calls were: psi::DiskDFJK::preiterations() |
0.25 works, but goes beyond the 2GB limit.
Here is an example using a safety factor of .20, and .75 (the default) with 4GB
...
DIIS
Now, I know DIIS saves state, so it might be the cause of the ramp. Turning off, we see exactly no change in memory behavior, other than the iterations are slower and more are needed:
...
DFT vs HF
Now, B3LYP/DZVP should be pretty stable; is the dispersion affecting this at all? Tried it, no. Just for fun; does using HF change anything? YES!
...
How about wb97x?
...
Yes, there is a ramp up (this one failed due to no disk space; currently working with a 18 GB space). However the memory requirements seem to be somewhat higher than b3lyp.
How about blyp? Yes
How about M06? Yes
How about TPSS? Yes
How about MP2? No (but the memory jumps way up after the HF finishes; ran out of disk space causing it to crash)
DF_INTS_NUM_THREADS
What about DF_INTS_NUM_THREADS
since the documentation says it could affect memory issues? Setting to 1 does seem to affect memory (note that all calculations are done using 8 cores):
...
There seems to only be a constant offset; the difference in the internal iterations is about the same as previous experiments.
Direct vs DF
Tried using SCF_TYPE
as DIRECT
, which turns off density fitting, recommended by Ben Pritchard. Documentation is https://psicode.org/psi4manual/master/scf.html#eri-algorithms
It is slow, and I am not patient enough to let this finish on my desktop as I have other things in the queue:
...
We see a slight overall ramp which is consistent with the CORE
algorithm shown below. Will try again later when things are idle.
- Let a full
DIRECT
run finish
Full Runs
CORE vs DISK algorithm
At this point; I am not sure what other options could affect memory. There seems to be a consistent ramp in memory for DFT; not sure if this is normal or not.
Now, let’s try to take the optimization out many steps using typical settings of 8 cores, 30GB. This was using the CORE
algorithm, meaning it could be done completely in memory. We have:
...
and it does fail after 76 iterations with:
Code Block |
---|
FailedOperation(error=ComputeError(error_type='unknown', error_message='geomeTRIC run_json error:\nTraceback (most recent call last):\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/run_json.py", line 214, in geometric_run_json\n geometric.optimize.Optimize(coords, M, IC, engine, None, params)\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1331, in Optimize\n return optimizer.optimizeGeometry()\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1298, in optimizeGeometry\n self.calcEnergyForce()\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1002, in calcEnergyForce\n spcalc = self.engine.calc(self.X, self.dirname)\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/engine.py", line 874, in calc\n return self.calc_new(coords, dirname)\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/engine.py", line 866, in calc_new\n raise QCEngineAPIEngineError("QCEngineAPI computation did not execute correctly. Message: " + ret["error"]["error_message"])\ngeometric.errors.QCEngineAPIEngineError: QCEngineAPI computation did not execute correctly. Message: QCEngine Unknown Error: Unknown error, error message is not found\n')) |
Code Block |
---|
geomeTRIC run_json error:
Traceback (most recent call last):
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/run_json.py", line 214, in geometric_run_json
geometric.optimize.Optimize(coords, M, IC, engine, None, params)
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1331, in Optimize
return optimizer.optimizeGeometry()
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1298, in optimizeGeometry
self.calcEnergyForce()
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1002, in calcEnergyForce
spcalc = self.engine.calc(self.X, self.dirname)
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/engine.py", line 874, in calc
return self.calc_new(coords, dirname)
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/engine.py", line 866, in calc_new
raise QCEngineAPIEngineError("QCEngineAPI computation did not execute correctly. Message: " + ret["error"]["error_message"])
geometric.errors.QCEngineAPIEngineError: QCEngineAPI computation did not execute correctly. Message: QCEngine Unknown Error: Unknown error, error message is not found
|
No memory errors. My machine has 31GB of memory and additional swap space, so if all is well there should have been enough memory to keep going even if virtual memory grabbed more than 31GB (I note that there is ~2GB more virtual than resident on average, so nothing crazy going on here). Let's try using the DISK
algorithm by setting the safety factor lower to .5 and memory to 16 GB. Realistically, this should have zero memory issues; even virtual memory shouldn’t hit the 16 GB limit. Here is what I have so far:
...
Died at 94, RIP.
Also, oddly, the memory spikes at iteration 75; this spike may have been the cause for the other failure where the memory fluctuations are much higher. Also notice that the memory tends to decrease in this run, whereas it tended to increase in the CORE
algorithm. Quite a difference in behavior.
Checking the output, it looks like everything is perfect, but the energy of the last psi4 run is None
and there is no output from it. Everything seems to point at a psi4 calculation failing to start. Will try plugging in the last schema manually and seeing if psi4 accepts it. Just looking at the last structure shows nothing remarkable.
It does finish successfully using HF:
...
Trying to load the last qcschema molecule into the task and running it, with some print statements in the QCE Psi4 harness. Noting that execute
is full of context managers and no exception handling. First two iterations went swimmingly, will see how long it lasts.
- Plot the results of the complete optimization starting from the failed one above (iteration 94)
= self.engine.calc(self.X, self.dirname)
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/engine.py", line 874, in calc
return self.calc_new(coords, dirname)
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/engine.py", line 866, in calc_new
raise QCEngineAPIEngineError("QCEngineAPI computation did not execute correctly. Message: " + ret["error"]["error_message"])
geometric.errors.QCEngineAPIEngineError: QCEngineAPI computation did not execute correctly. Message: QCEngine Unknown Error: Traceback (most recent call last):
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/schema_wrapper.py", line 411, in run_qcschema
ret_data = run_json_qcschema(input_model.dict(), clean, False, keep_wfn=keep_wfn)
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/schema_wrapper.py", line 558, in run_json_qcschema
val, wfn = methods_dict_[json_data["driver"]](method, **kwargs)
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/driver.py", line 717, in gradient
wfn = procedures['gradient'][lowername](lowername, molecule=molecule, **kwargs)
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/procrouting/proc.py", line 2383, in run_scf_gradient
ref_wfn = run_scf(name, **kwargs)
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/procrouting/proc.py", line 2288, in run_scf
scf_wfn = scf_helper(name, post_scf=False, **kwargs)
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/procrouting/proc.py", line 1568, in scf_helper
e_scf = scf_wfn.compute_energy()
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/procrouting/scf_proc/scf_iterator.py", line 83, in scf_compute_energy
self.initialize()
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/procrouting/scf_proc/scf_iterator.py", line 193, in scf_initialize
self.initialize_jk(self.memory_jk_, jk=jk)
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib//python3.7/site-packages/psi4/driver/procrouting/scf_proc/scf_iterator.py", line 128, in initialize_jk
jk.initialize()
RuntimeError:
Fatal Error: SCF::DF: Disk based algorithm requires 2 (A|B) fitting metrics and an (A|mn) chunk on core.
This is 2Q^2 + QNP doubles, where Q is the auxiliary basis size, N is the
primary basis size, and P is the maximum number of functions in a primary shell.
For this problem, that is 374074280 bytes before taxes,534391828 bytes after taxes.
Error occurred in file: /scratch/psilocaluser/conda-builds/psi4-multiout_1609773118680/work/psi4/src/psi4/libfock/DiskDFJK.cc on line: 727
The most recent 5 function calls were:
psi::DiskDFJK::preiterations()
|
0.25 works, but goes beyond the 2GB limit.
Here is an example using a safety factor of .20, and .75 (the default) with 4GB
...
DIIS
Now, I know DIIS saves state, so it might be the cause of the ramp. Turning off, we see exactly no change in memory behavior, other than the iterations are slower and more are needed:
...
DFT vs HF
Now, B3LYP/DZVP should be pretty stable; is the dispersion affecting this at all? Tried it, no. Just for fun; does using HF change anything? YES!
...
How about wb97x?
...
Yes, there is a ramp up (this one failed due to no disk space; currently working with a 18 GB space). However the memory requirements seem to be somewhat higher than b3lyp.
How about blyp? Yes
How about M06? Yes
How about TPSS? Yes
How about MP2? No (but the memory jumps way up after the HF finishes; ran out of disk space causing it to crash)
DF_INTS_NUM_THREADS
What about DF_INTS_NUM_THREADS
since the documentation says it could affect memory issues? Setting to 1 does seem to affect memory (note that all calculations are done using 8 cores):
...
There seems to only be a constant offset; the difference in the internal iterations is about the same as previous experiments.
Direct vs DF
Tried using SCF_TYPE
as DIRECT
, which turns off density fitting, recommended by Ben Pritchard. Documentation is https://psicode.org/psi4manual/master/scf.html#eri-algorithms
It is slow, and I am not patient enough to let this finish on my desktop as I have other things in the queue:
...
We see a slight overall ramp which is consistent with the CORE
algorithm shown below. Will try again later when things are idle.
- Let a full
DIRECT
run finish
Full Runs
CORE vs DISK algorithm
At this point; I am not sure what other options could affect memory. There seems to be a consistent ramp in memory for DFT; not sure if this is normal or not.
Now, let’s try to take the optimization out many steps using typical settings of 8 cores, 30GB. This was using the CORE
algorithm, meaning it could be done completely in memory. We have:
...
and it does fail after 76 iterations with:
Code Block |
---|
FailedOperation(error=ComputeError(error_type='unknown', error_message='geomeTRIC run_json error:\nTraceback (most recent call last):\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/run_json.py", line 214, in geometric_run_json\n geometric.optimize.Optimize(coords, M, IC, engine, None, params)\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1331, in Optimize\n return optimizer.optimizeGeometry()\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1298, in optimizeGeometry\n self.calcEnergyForce()\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1002, in calcEnergyForce\n spcalc = self.engine.calc(self.X, self.dirname)\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/engine.py", line 874, in calc\n return self.calc_new(coords, dirname)\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/engine.py", line 866, in calc_new\n raise QCEngineAPIEngineError("QCEngineAPI computation did not execute correctly. Message: " + ret["error"]["error_message"])\ngeometric.errors.QCEngineAPIEngineError: QCEngineAPI computation did not execute correctly. Message: QCEngine Unknown Error: Unknown error, error message is not found\n')) |
Code Block |
---|
geomeTRIC run_json error:
Traceback (most recent call last):
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/run_json.py", line 214, in geometric_run_json
geometric.optimize.Optimize(coords, M, IC, engine, None, params)
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1331, in Optimize
return optimizer.optimizeGeometry()
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1298, in optimizeGeometry
self.calcEnergyForce()
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1002, in calcEnergyForce
spcalc = self.engine.calc(self.X, self.dirname)
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/engine.py", line 874, in calc
return self.calc_new(coords, dirname)
File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/engine.py", line 866, in calc_new
raise QCEngineAPIEngineError("QCEngineAPI computation did not execute correctly. Message: " + ret["error"]["error_message"])
geometric.errors.QCEngineAPIEngineError: QCEngineAPI computation did not execute correctly. Message: QCEngine Unknown Error: Unknown error, error message is not found
|
No memory errors. My machine has 31GB of memory and additional swap space, so if all is well there should have been enough memory to keep going even if virtual memory grabbed more than 31GB (I note that there is ~2GB more virtual than resident on average, so nothing crazy going on here). Let's try using the DISK
algorithm by setting the safety factor lower to .5 and memory to 16 GB. Realistically, this should have zero memory issues; even virtual memory shouldn’t hit the 16 GB limit. Here is what I have so far:
...
Died at 94, RIP.
Also, oddly, the memory spikes at iteration 75; this spike may have been the cause for the other failure where the memory fluctuations are much higher. Also notice that the memory tends to decrease in this run, whereas it tended to increase in the CORE
algorithm. Quite a difference in behavior.
Checking the output, it looks like everything is perfect, but the energy of the last psi4 run is None
and there is no output from it. Everything seems to point at a psi4 calculation failing to start. Will try plugging in the last schema manually and seeing if psi4 accepts it. Just looking at the last structure shows nothing remarkable.
It does finish successfully using HF:
...
Trying to load the last qcschema molecule into the task and running it, with some print statements in the QCE Psi4 harness. Noting that execute
is full of context managers and no exception handling. First two iterations went swimmingly, will see how long it lasts.
- Plot the results of the complete optimization starting from the failed one above (iteration 94)
Segfaults
After spending enough time on this, I noticed that there was a random crash upon starting a job. To investigate, I took the problem input from above (at iteration 94), and ran it for 5 seconds, 10000 times and recorded how many times it segfaulted. It segfaulted 62/10000 times so far (1%) but was near 3% at certain times. Something tells me it is thread related since there was a period of time when too many psi4 jobs were running and casing them all to run at 50% CPU. Anywho, I ran this after modifying QCEngine to dump any and all contents from the Psi4Harness https://github.com/MolSSI/QCEngine/blob/master/qcengine/programs/psi4.py#L217
Code Block |
---|
run_cmd = [
which("psi4"),
"--scratch",
str(tmpdir),
"--nthread",
str(config.ncores),
"--memory",
f"{config.memory}GB",
"--qcschema",
"data.msgpack",
]
input_files = {"data.msgpack": input_model.serialize("msgpack-ext")}
try:
success, output = execute(
run_cmd, input_files, ["data.msgpack"], as_binary=["data.msgpack"], scratch_directory=tmpdir
)
print("SUCCESS IS", success)
print("OUTPUT IS\n", output)
except Exception as e:
print("BORK")
breakpoint()
if success:
output_data = deserialize(output["outfiles"]["data.msgpack"], "msgpack-ext")
else:
output_data = input_model.dict()
print("\n\nOUTPUT DATA IS\n", output_data)
|
It should be pretty clear which uncultured barbarian lines I added The point is the dump the output that psi4 outputs directly, then, just because I have no idea what is going on, check it after QCEngine has had a chance to get its hands on it. In this case, I was wise to check both versions:
Code Block |
---|
SUCCESS IS True
OUTPUT IS
{'proc': <subprocess.Popen object at 0x7f0a78fe36d0>, 'stdout': '', 'stderr': 'forrtl: severe (174): SIGSEGV, segmentation fault occurred\nImage PC Routine Line Source \nlibgdma.so 00007F60FDA7D0DC for__signal_handl Unknown Unknown\nlibpthread-2.32.s 00007F61046D7830 Unknown Unknown Unknown\nlibstdc++.so.6.0. 00007F609C4B68C9 Unkno
wn Unknown Unknown\nlibstdc++.so.6.0. 00007F609C4B6AF8 _ZSt29_Rb_tree_in Unknown Unknown\ncore.cpython-37m- 00007F61007D16AE Unknown Unknown Unknown\ncore.cpython-37m- 00007F61007E2542 Unknown Unknown Unknown\nlibomp.so 00007F609C60E153 __kmp_invoke_micr Unknown Unknown\nlibomp.so 00007F609C5B0CED Unknown Unknown Unknown\nlibom
p.so 00007F609C5B4CF6 __kmp_fork_call Unknown Unknown\nlibomp.so 00007F609C5A39E2 __kmpc_fork_call Unknown Unknown\ncore.cpython-37m- 00007F61007E10F5 Unknown Unknown Unknown\ncore.cpython-37m- 00007F61007F1BA0 Unknown Unknown Unknown\ncore.cpython-37m- 00007F61007F0587 Unknown Unknown Unknown\ncore.cpython-37m- 00007F61009C9969 _ZN3psi5
VBase10in Unknown Unknown\ncore.cpython-37m- 00007F61009C86A9 Unknown Unknown Unknown\ncore.cpython-37m- 00007F6102540A4E Unknown Unknown Unknown\ncore.cpython-37m- 00007F610252878D Unknown Unknown Unknown\ncore.cpython-37m- 00007F610254DC43 Unknown Unknown Unknown\ncore.cpython-37m- 00007F60FEA88E8A Unknown Unknown Unknown\ncore.cpy
thon-37m- 00007F60FE6CF66E Unknown Unknown Unknown\npython3.7 0000559523213504 _PyMethodDef_RawF Unknown Unknown\npython3.7 0000559523215B9A Unknown Unknown Unknown\npython3.7 00005595231F3191 PyObject_Call Unknown Unknown\npython3.7 00005595232525A4 Unknown Unknown Unknown\npython3.7 00005595231F2A7A Unknown
Unknown Unknown\ncore.cpython-37m- 00007F60FE6C95BB Unknown Unknown Unknown\npython3.7 000055952322DF3B _PyObject_FastCal Unknown Unknown\npython3.7 000055952322EA79 Unknown Unknown Unknown\npython3.7 00005595232A174A _PyEval_EvalFrame Unknown Unknown\npython3.7 00005595231E0932 _PyEval_EvalCodeW Unknown Unknown\npython3.7
00005595231E21F0 _PyFunction_FastC Unknown Unknown\npython3.7 000055952329E869 _PyEval_EvalFrame Unknown Unknown\npython3.7 00005595231E0932 _PyEval_EvalCodeW Unknown Unknown\npython3.7 00005595231E21F0 _PyFunction_FastC Unknown Unknown\npython3.7 000055952329E869 _PyEval_EvalFrame Unknown Unknown\npython3.7 00005595231E0932 _PyEval_EvalCo
deW Unknown Unknown\npython3.7 00005595231E21F0 _PyFunction_FastC Unknown Unknown\npython3.7 000055952329E869 _PyEval_EvalFrame Unknown Unknown\npython3.7 00005595231E0932 _PyEval_EvalCodeW Unknown Unknown\npython3.7 00005595231E21F0 _PyFunction_FastC Unknown Unknown\npython3.7 000055952329E869 _PyEval_EvalFrame Unknown Unknown\npython3.7
00005595231E0932 _PyEval_EvalCodeW Unknown Unknown\npython3.7 00005595231E21F0 _PyFunction_FastC Unknown Unknown\npython3.7 000055952329E869 _PyEval_EvalFrame Unknown Unknown\npython3.7 00005595231E0932 _PyEval_EvalCodeW Unknown Unknown\npython3.7 000055952322C700 _PyFunction_FastC Unknown Unknown\npython3.7 000055952322E8E8 Unknown
Unknown Unknown\npython3.7 000055952329DF95 _PyEval_EvalFrame Unknown Unknown\npython3.7 00005595231E0932 _PyEval_EvalCodeW Unknown Unknown\npython3.7 000055952322C700 _PyFunction_FastC Unknown Unknown\npython3.7 000055952322E8E8 Unknown Unknown Unknown\npython3.7 00005595232A174A _PyEval_EvalFrame Unknown Unknown\npython3.7
00005595231E0932 _PyEval_EvalCodeW Unknown Unknown\npython3.7 00005595231E1B49 PyEval_EvalCodeEx Unknown Unknown\npython3.7 00005595232C1C2B PyEval_EvalCode Unknown Unknown\npython3.7 000055952332E3FF Unknown Unknown Unknown\npython3.7 00005595233389C7 PyRun_FileExFlags Unknown Unknown\npython3.7 0000559523338B9C PyRun_SimpleFileE
Unknown Unknown\npython3.7 0000559523339129 Unknown Unknown Unknown\npython3.7 000055952333927C _Py_UnixMain Unknown Unknown\nlibc-2.32.so 00007F61043E2EBB __libc_start_main Unknown Unknown\npython3.7 00005595232ADC99 Unknown Unknown Unknown\ndouble free or corruption (fasttop)\n', 'outfiles': {{{msgpack binary stuff}}}
OUTPUT DATA IS
{'id': None, 'schema_name': 'qcschema_input', 'schema_version': 1, 'molecule': {'schema_name': 'qcschema_molecule', 'schema_version': 2, 'validated': True, 'symbols': array(['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'S', 'S', 'C',
'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'N', 'N', 'N', 'N',
'N', 'N', 'N', 'N', 'N', 'N', 'C', 'C', 'C', 'C', 'C', 'C', 'C',
'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'H',
'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H',
'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H',
'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H',
'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H'], dtype='<U1'), 'geometry': array([[-3.81401710e+00, 4.52186150e+00, -8.71215711e+00],
[ 7.04721078e+00, 6.06583650e+00, -7.69161444e+00],
[ 5.00674514e+00, -9.58739849e+00, 9.73687713e+00],
[-1.17332843e+01, -8.80603461e-01, -6.73137831e+00],
[-5.08855645e+00, -6.46906296e-01, -5.05759736e+00],
[ 6.45242066e+00, -5.57460393e+00, 3.03384361e+00],
[-3.53073787e+00, -7.09759523e+00, 5.59306174e+00],
[ 2.35267227e+00, 4.72746885e+00, -3.32531886e-01],
[ 9.27238920e+00, 1.22310123e-01, 3.52753032e+00],
[-7.44822757e+00, -7.75141307e-01, 1.96138265e+00],
[ 9.48349600e-01, -2.50327718e+00, 3.08054417e+00],
[-1.22971885e+00, 3.85946463e-01, 4.67143210e+00],
[-5.52365198e+00, 3.21701062e+00, -9.60957029e+00],
[ 4.90541071e+00, 6.60231010e+00, -6.95805669e+00],
[ 3.37044220e+00, -8.01013092e+00, 9.18668615e+00],
[-1.11319445e+01, -2.64910622e-03, -4.65737160e+00],
[-5.73672449e+00, 1.41073905e+00, -4.20746705e+00],
[ 7.44521225e+00, -5.85670023e+00, 5.11676902e+00],
[-5.28466667e+00, -5.77330568e+00, 4.80161712e+00],
[ 4.67503852e+00, 4.53824359e+00, -5.95477824e-01],
[ 7.03480063e+00, -1.26831145e-01, 4.09066519e+00],
[-5.49523382e+00, -1.48561054e+00, 8.29893944e-01],
[ 1.13892612e+00, -8.66150511e+00, 8.11887045e+00],
[-1.12125337e+01, -1.38697592e+00, -2.48373683e+00],
[ 8.69909870e+00, -8.01912576e+00, 5.79528224e+00],
[-6.53599366e+00, -4.07232171e+00, 6.25652493e+00],
[-7.66170895e+00, 2.65771474e+00, -8.25203930e+00],
[ 6.14069163e+00, -1.51894960e+00, 6.07669704e+00],
[-4.86369099e+00, -3.93512037e+00, 5.79946814e-01],
[ 4.35179483e+00, 6.94950411e+00, -4.44114009e+00],
[ 6.06072956e+00, 2.85676980e+00, 7.47042595e-01],
[-4.97671458e+00, 2.24813217e+00, -1.88820420e+00],
[-5.40085965e+00, 2.13309667e+00, -1.22529921e+01],
[ 2.71575097e+00, 7.02759709e+00, -8.76251564e+00],
[ 5.44882690e+00, 1.01649521e+01, 2.53403305e-01],
[ 8.97680891e+00, 9.98957580e+00, -2.97481694e+00],
[ 3.72247808e+00, -5.21471679e+00, 9.81058205e+00],
[-3.06443575e+00, -5.03427507e+00, -1.27055828e+00],
[-2.82072628e+00, -7.78052615e+00, -4.15610359e-01],
[-1.02714219e+01, 2.73829321e+00, -4.38302867e+00],
[ 6.48539085e+00, -4.33287427e+00, 9.71359571e+00],
[-5.41702428e+00, -8.38668326e+00, 7.35914348e-01],
[ 7.37391098e+00, 8.47164268e+00, -1.09801883e+00],
[ 3.29746975e+00, -8.22527958e-01, 1.17721286e+00],
[-2.59444281e+00, 2.10382969e+00, 1.97899208e+00],
[-7.63937320e+00, 3.14760427e+00, -5.57031731e+00],
[ 7.52253808e+00, -3.67743867e+00, 7.07103830e+00],
[-6.20588931e+00, -5.92068720e+00, 2.03189511e+00],
[ 6.11245334e+00, 6.19909696e+00, -2.46847124e+00],
[ 4.92429049e+00, 1.16362259e+00, 2.57665370e+00],
[-3.67126969e+00, 5.22410899e-01, -1.93852878e-01],
[-1.51885474e-01, -7.39079802e+00, 7.48667327e+00],
[ 8.74834297e-01, -1.04971125e+01, 7.64026395e+00],
[-1.02764581e+01, -8.48493292e-01, -8.92864240e-01],
[-1.15604601e+01, -3.25797807e+00, -2.70270989e+00],
[ 8.57260530e+00, -8.57808659e+00, 7.63260029e+00],
[ 8.40937798e+00, -9.46152669e+00, 4.56135136e+00],
[-5.81315441e+00, -3.68527578e+00, 7.98831429e+00],
[-7.56687146e+00, -2.70935385e+00, 5.36913893e+00],
[-8.81188813e+00, 1.23295403e+00, -8.83807111e+00],
[ 4.24488746e+00, -1.82286889e+00, 5.98407615e+00],
[ 2.51308284e+00, 6.99695324e+00, -3.90198881e+00],
[ 7.96747054e+00, 2.72566735e+00, 5.50748473e-01],
[-5.76493411e+00, 3.82721811e+00, -1.15170240e+00],
[-4.58674687e+00, 3.53967157e+00, -1.35246559e+01],
[-4.12394696e+00, 5.04550710e-01, -1.22108910e+01],
[-7.23788131e+00, 1.51484562e+00, -1.29664367e+01],
[ 3.04952916e+00, 5.92839312e+00, -1.04752020e+01],
[ 2.65835927e+00, 9.03010493e+00, -9.28890947e+00],
[ 8.77840501e-01, 6.50696951e+00, -7.97648878e+00],
[ 6.42536354e+00, 1.16909874e+01, 1.25383428e+00],
[ 4.31548410e+00, 9.11928599e+00, 1.63125118e+00],
[ 4.15501070e+00, 1.10309500e+01, -1.11108499e+00],
[ 7.78606562e+00, 1.08145790e+01, -4.45068743e+00],
[ 9.95095234e+00, 1.15334236e+01, -1.99836287e+00],
[ 1.03895428e+01, 8.80084184e+00, -3.90459990e+00],
[ 2.46413554e+00, -4.03313170e+00, 8.67043740e+00],
[ 3.05523295e+00, -5.01376466e+00, 1.17653383e+01],
[-3.86454709e+00, -4.83642390e+00, -3.16839101e+00],
[-1.25735359e+00, -4.05454348e+00, -1.22324250e+00],
[-1.36417214e+00, -7.93065822e+00, 1.03910042e+00],
[-2.33098845e+00, -9.02885169e+00, -1.98539676e+00],
[-1.02299907e+01, 3.28812111e+00, -2.39210628e+00],
[-1.16346139e+01, 3.94309860e+00, -5.36134766e+00],
[ 6.67514200e+00, -2.61563352e+00, 1.08484072e+01],
[ 7.66078197e+00, -5.77510517e+00, 1.06115900e+01],
[-5.34446280e+00, -9.95004967e+00, 2.07730985e+00],
[-6.78048110e+00, -8.85147801e+00, -7.49124418e-01],
[ 8.64352290e+00, 7.65637724e+00, 3.28849938e-01],
[ 2.31802253e+00, 1.15036562e-01, -3.72750008e-01],
[ 4.48428834e+00, -2.33720155e+00, 4.21622376e-01],
[-4.12560530e+00, 3.17708600e+00, 2.86828333e+00],
[-1.16343006e+00, 3.39426255e+00, 1.24252314e+00],
[-7.06611840e+00, 5.11363628e+00, -5.29245283e+00],
[ 9.50408280e+00, -3.12506991e+00, 7.26868772e+00],
[-8.24033641e+00, -5.57543196e+00, 1.94927554e+00],
[ 7.60990420e+00, 5.10059075e+00, -3.37732271e+00],
[ 3.71030901e+00, 2.25913305e+00, 3.84582191e+00],
[-2.14850446e+00, -3.46989331e-01, -1.26413910e+00]]), 'name': 'C29H48N10O10S2', 'identifiers': {'molecule_hash': 'bd5c30549dcd8cae9552be466581a0bc2ae897a9', 'molecular_formula': 'C29H48N10O10S2'}, 'molecular_charge': 0.0, 'molecular_multiplicity': 1, 'connectivity': [[0, 12, 2.0], [1, 13, 2.0], [2, 14, 2.0], [3, 15, 2.0], [4, 16, 2.0], [5, 17, 2.0], [6, 18, 2.0], [7, 19, 2.0], [8, 20, 2.0], [9, 21, 2.0], [10, 11, 1.0], [10, 43, 1.0], [11, 44, 1.0], [12, 26, 1.0], [12, 32, 1.0], [13, 29, 1.0], [13, 33, 1.0], [14, 22, 1.0], [14, 36, 1.0], [15, 23, 1.0], [15, 39, 1.0], [16, 31, 1.0], [16, 45, 1.0], [17, 24, 1.0], [17, 46, 1.0], [18, 25, 1.0], [18, 47, 1.0], [19, 30, 1.0], [19, 48, 1.0], [20, 27, 1.0], [20, 49, 1.0], [21, 28, 1.0], [21, 50, 1.0], [22, 51, 1.0], [22, 52, 1.0], [23, 53, 1.0], [23, 54, 1.0], [24, 55, 1.0], [24, 56, 1.0], [25, 57, 1.0], [25, 58, 1.0], [26, 45, 1.0], [26, 59, 1.0], [27, 46, 1.0], [27, 60, 1.0], [28, 37, 1.0], [28, 47, 1.0], [29, 48, 1.0], [29, 61, 1.0], [30, 49, 1.0], [30, 62, 1.0], [31, 50, 1.0], [31, 63, 1.0], [32, 64, 1.0], [32, 65, 1.0], [32, 66, 1.0], [33, 67, 1.0], [33, 68, 1.0], [33, 69, 1.0], [34, 42, 1.0], [34, 70, 1.0], [34, 71, 1.0], [34, 72, 1.0], [35, 42, 1.0], [35, 73, 1.0], [35, 74, 1.0], [35, 75, 1.0], [36, 40, 1.0], [36, 76, 1.0], [36, 77, 1.0], [37, 38, 1.0], [37, 78, 1.0], [37, 79, 1.0], [38, 41, 1.0], [38, 80, 1.0], [38, 81, 1.0], [39, 45, 1.0], [39, 82, 1.0], [39, 83, 1.0], [40, 46, 1.0], [40, 84, 1.0], [40, 85, 1.0], [41, 47, 1.0], [41, 86, 1.0], [41, 87, 1.0], [42, 48, 1.0], [42, 88, 1.0], [43, 49, 1.0], [43, 89, 1.0], [43, 90, 1.0], [44, 50, 1.0], [44, 91, 1.0], [44, 92, 1.0], [45, 93, 1.0], [46, 94, 1.0], [47, 95, 1.0], [48, 96, 1.0], [49, 97, 1.0], [50, 98, 1.0]], 'fix_com': True, 'fix_orientation': True, 'fix_symmetry': 'c1', 'provenance': {'creator': 'QCElemental', 'version': 'v0.17.0', 'routine': 'qcelemental.molparse.from_schema'}, 'id': '24775476', 'extras': {'canonical_isomeric_explicit_hydrogen_mapped_smiles': '[O:1]=[C:13]([N:27]([C@:46]([C:17](=[O:5])[N:32]([C@:51]([C:22](=[O:10])[N:29]1[C:38]([H:79])([H:80])[C:39]([H:81])([H:82])[C:42]([H:87])([H:88])[C@:48]1([C:19](=[O:7])[N:26]([H:58])[H:59])[H:96])([C:45]([S:12][S:11][C:44]([C@@:50]([C:21](=[O:9])[N:28]([C@:47]([C:18](=[O:6])[N:25]([H:56])[H:57])([C:41]([C:37]([C:15](=[O:3])[N:23]([H:52])[H:53])([H:77])[H:78])([H:85])[H:86])[H:95])[H:61])([N:31]([C:20](=[O:8])[C@@:49]([N:30]([C:14](=[O:2])[C:34]([H:68])([H:69])[H:70])[H:62])([C:43]([C:35]([H:71])([H:72])[H:73])([C:36]([H:74])([H:75])[H:76])[H:89])[H:97])[H:63])[H:98])([H:90])[H:91])([H:92])[H:93])[H:99])[H:64])([C:40]([C:16](=[O:4])[N:24]([H:54])[H:55])([H:83])[H:84])[H:94])[H:60])[C:33]([H:65])([H:66])[H:67]'}}, 'driver': 'gradient', 'model': {'method': 'b3lyp-d3bj', 'basis': 'dzvp'}, 'keywords': {'maxiter': 300, 'scf_properties': ['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices'], 'scf_mem_safety_factor': 0.5}, 'protocols': {}, 'extras': {}, 'provenance': {'creator': 'QCElemental', 'version': 'v0.17.0', 'routine': 'qcelemental.models.results'}}
Computation done.
FailedOperation(error=ComputeError(error_type='unknown', error_message='geomeTRIC run_json error:\nTraceback (most recent call last):\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/run_json.py", line 214, in geometric_run_json\n geometric.optimize.Optimize(coords, M, IC, engine, None, params)\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1331, in Optimize\n return optimizer.optimizeGeometry()\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1293, in optimizeGeometry\n self.calcEnergyForce()\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/optimize.py", line 1002, in calcEnergyForce\n spcalc = self.engine.calc(self.X, self.dirname)\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/engine.py", line 874, in calc\n return self.calc_new(coords, dirname)\n File "/home/tgokey/.local/miniconda3/envs/psi4-test/lib/python3.7/site-packages/geometric/engine.py", line 866, in calc_new\n raise QCEngineAPIEngineError("QCEngineAPI computation did not execute correctly. Message: " + ret["error"]["error_message"])\ngeometric.errors.QCEngineAPIEngineError: QCEngineAPI computation did not execute correctly. Message: QCEngine Unknown Error: Unknown error, error message is not found\n'))
|
The segfault is from above is
Code Block |
---|
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libgdma.so 00007F60FDA7D0DC for__signal_handl Unknown Unknown
libpthread-2.32.s 00007F61046D7830 Unknown Unknown Unknown
libstdc++.so.6.0. 00007F609C4B68C9 Unknown Unknown Unknown
libstdc++.so.6.0. 00007F609C4B6AF8 _ZSt29_Rb_tree_in Unknown Unknown
core.cpython-37m- 00007F61007D16AE Unknown Unknown Unknown
core.cpython-37m- 00007F61007E2542 Unknown Unknown Unknown
libomp.so 00007F609C60E153 __kmp_invoke_micr Unknown Unknown
libomp.so 00007F609C5B0CED Unknown Unknown Unknown
libomp.so 00007F609C5B4CF6 __kmp_fork_call Unknown Unknown
libomp.so 00007F609C5A39E2 __kmpc_fork_call Unknown Unknown
core.cpython-37m- 00007F61007E10F5 Unknown Unknown Unknown
core.cpython-37m- 00007F61007F1BA0 Unknown Unknown Unknown
core.cpython-37m- 00007F61007F0587 Unknown Unknown Unknown
core.cpython-37m- 00007F61009C9969 _ZN3psi5VBase10in Unknown Unknown
core.cpython-37m- 00007F61009C86A9 Unknown Unknown Unknown
core.cpython-37m- 00007F6102540A4E Unknown Unknown Unknown
core.cpython-37m- 00007F610252878D Unknown Unknown Unknown
core.cpython-37m- 00007F610254DC43 Unknown Unknown Unknown
core.cpython-37m- 00007F60FEA88E8A Unknown Unknown Unknown
core.cpython-37m- 00007F60FE6CF66E Unknown Unknown Unknown
python3.7 0000559523213504 _PyMethodDef_RawF Unknown Unknown
python3.7 0000559523215B9A Unknown Unknown Unknown
python3.7 00005595231F3191 PyObject_Call Unknown Unknown
python3.7 00005595232525A4 Unknown Unknown Unknown
python3.7 00005595231F2A7A Unknown Unknown Unknown
core.cpython-37m- 00007F60FE6C95BB Unknown Unknown Unknown
python3.7 000055952322DF3B _PyObject_FastCal Unknown Unknown
python3.7 000055952322EA79 Unknown Unknown Unknown
python3.7 00005595232A174A _PyEval_EvalFrame Unknown Unknown
python3.7 00005595231E0932 _PyEval_EvalCodeW Unknown Unknown
python3.7 00005595231E21F0 _PyFunction_FastC Unknown Unknown
python3.7 000055952329E869 _PyEval_EvalFrame Unknown Unknown
python3.7 00005595231E0932 _PyEval_EvalCodeW Unknown Unknown
python3.7 00005595231E21F0 _PyFunction_FastC Unknown Unknown
python3.7 000055952329E869 _PyEval_EvalFrame Unknown Unknown
python3.7 00005595231E0932 _PyEval_EvalCodeW Unknown Unknown
python3.7 00005595231E21F0 _PyFunction_FastC Unknown Unknown
python3.7 000055952329E869 _PyEval_EvalFrame Unknown Unknown
python3.7 00005595231E0932 _PyEval_EvalCodeW Unknown Unknown
python3.7 00005595231E21F0 _PyFunction_FastC Unknown Unknown
python3.7 000055952329E869 _PyEval_EvalFrame Unknown Unknown
python3.7 00005595231E0932 _PyEval_EvalCodeW Unknown Unknown
python3.7 00005595231E21F0 _PyFunction_FastC Unknown Unknown
python3.7 000055952329E869 _PyEval_EvalFrame Unknown Unknown
python3.7 00005595231E0932 _PyEval_EvalCodeW Unknown Unknown
python3.7 000055952322C700 _PyFunction_FastC Unknown Unknown
python3.7 000055952322E8E8 Unknown Unknown Unknown
python3.7 000055952329DF95 _PyEval_EvalFrame Unknown Unknown
python3.7 00005595231E0932 _PyEval_EvalCodeW Unknown Unknown
python3.7 000055952322C700 _PyFunction_FastC Unknown Unknown
python3.7 000055952322E8E8 Unknown Unknown Unknown
python3.7 00005595232A174A _PyEval_EvalFrame Unknown Unknown
python3.7 00005595231E0932 _PyEval_EvalCodeW Unknown Unknown
python3.7 00005595231E1B49 PyEval_EvalCodeEx Unknown Unknown
python3.7 00005595232C1C2B PyEval_EvalCode Unknown Unknown
python3.7 000055952332E3FF Unknown Unknown Unknown
python3.7 00005595233389C7 PyRun_FileExFlags Unknown Unknown
python3.7 0000559523338B9C PyRun_SimpleFileE Unknown Unknown
python3.7 0000559523339129 Unknown Unknown Unknown
python3.7 000055952333927C _Py_UnixMain Unknown Unknown
libc-2.32.so 00007F61043E2EBB __libc_start_main Unknown Unknown
python3.7 00005595232ADC99 Unknown Unknown Unknown
double free or corruption (fasttop) |
Notice how this error is nowhere in the parsed data (success=True
oddly enough, so deserialize
seems to be the culprit for swallowing this segfault). Looking 1 nm closer, it looks like the part of the output that has this segfault is completely dropped in all cases, since deserialize
is only taking the output sections. Likely this should be changed since this error will never, ever a see the light of a computer screen.
In any case, more tests are needed to see how prevalent this issue is. For now, a %1 failure rate would tell me that I have a good chance in not finishing an optimization if psi4 is called 100 times (i.e. 100 minimization steps in geomeTRIC). Since the error rate is about %1, I run 1000 iterations using the original input, which should be no issue. I wasn’t getting many issues when each process had about 1 CPU, when I reduce the number to 4 concurrent tests, I see segfaults again, at this point 2/64.
Testing with 1 core jobs, appears to be smooth sailing… so if the last paragraph read like spaghetti, I can summarize by saying using multiple threads in psi4 causes a random, instant ( < 5 seconds) crash. At this point I’d like to compile my own versions of psi4 and test, but this is getting into areas more for my bemusement than solving QCA issues, which will always use precooked binaries. It does look like an OpenMP parallel section though based on the stack trace. I wonder how kosher the conda supplied libraries are?