Troubleshoot COMPLETE > INCOMPLETE cases in usage of QCArchive at Genentech for benchmark geometry optimizations
Identify where in the server codebase these issue can arise
Assess whether upcoming server codebase changes will mitigate or eliminate this issue
Discussion topics
Item
Presenter
Notes
Evidences and observations
Bill
Observing “Bus error” for some SLURM jobs
Observing OOM on some SLURM jobs; these are for combined launches of qcfractal-server and qcfractal-manager
Assessment and proposed solutions
Ben + David
BP: worried multiple server and postgres process running at once; could happen if two different nodes see the same postgres persistent files on a network filesystem
BS: nope, only running once at a time
For 24-hour problem, could be that manager+server are running out of memory, causing workers to die when manager gets killed
BS: will increase memory allocation to manager+server
will also increase memory allocation per worker in manager config
If worker processes continue to die within some short time window, BS will log into compute node where server+manager are running and see if one of those processes have died
COMPLETE > INCOMPLETE
Bill
Want to interrogate the server logs next for an optimization record that goes from COMPLETE > INCOMPLETE
BS will run something like the following for a known molid that switches in this way; we can then grep the server logs for that record id
from qcfractal.interface import FractalClient
fc = FractalClient(<uri>)
optds = fc.get_collection('OptimizationDataset', <dataset_name>)
optds.status()
opt = optds.df.loc[<molid>, 'b3lyp-d3bj/dzvp']
print(opt.id)
Add Comment