2021-03-15 QCF Troubleshooting Meeting notes

Date

15 Mar 2021

Troubleshoot COMPLETE > INCOMPLETE cases in usage of QCArchive at Genentech for benchmark geometry optimizations
Identify where in the server codebase these issue can arise
Assess whether upcoming server codebase changes will mitigate or eliminate this issue

Item	Presenter	Notes
Evidences and observations	Bill	Observing “Bus error” for some SLURM jobs Observing OOM on some SLURM jobs; these are for combined launches of `qcfractal-server` and `qcfractal-manager`
Assessment and proposed solutions	Ben + David	BP: worried multiple server and postgres process running at once; could happen if two different nodes see the same postgres persistent files on a network filesystem BS: nope, only running once at a time For 24-hour problem, could be that manager+server are running out of memory, causing workers to die when manager gets killed BS: will increase memory allocation to manager+server will also increase memory allocation per worker in manager config If worker processes continue to die within some short time window, BS will log into compute node where server+manager are running and see if one of those processes have died
COMPLETE > INCOMPLETE	Bill	Want to interrogate the server logs next for an optimization record that goes from COMPLETE > INCOMPLETE BS will run something like the following for a known molid that switches in this way; we can then grep the server logs for that record id from qcfractal.interface import FractalClient fc = FractalClient(<uri>) optds = fc.get_collection('OptimizationDataset', <dataset_name>) optds.status() opt = optds.df.loc[<molid>, 'b3lyp-d3bj/dzvp'] print(opt.id)