2021-03-15 QCF Troubleshooting Meeting notes

Date

Mar 15, 2021

Participants

@David Dotson
Bill Swope
Ben Pritchard

Goals

Troubleshoot COMPLETE > INCOMPLETE cases in usage of QCArchive at Genentech for benchmark geometry optimizations
Identify where in the server codebase these issue can arise
Assess whether upcoming server codebase changes will mitigate or eliminate this issue

Discussion topics

Item	Presenter	Notes

Item	Presenter	Notes
Evidences and observations	Bill	Observing “Bus error” for some SLURM jobs Observing OOM on some SLURM jobs; these are for combined launches of `qcfractal-server` and `qcfractal-manager`
Assessment and proposed solutions	Ben + David	BP: worried multiple server and postgres process running at once; could happen if two different nodes see the same postgres persistent files on a network filesystem BS: nope, only running once at a time For 24-hour problem, could be that manager+server are running out of memory, causing workers to die when manager gets killed BS: will increase memory allocation to manager+server will also increase memory allocation per worker in manager config If worker processes continue to die within some short time window, BS will log into compute node where server+manager are running and see if one of those processes have died
COMPLETE > INCOMPLETE	Bill	Want to interrogate the server logs next for an optimization record that goes from COMPLETE > INCOMPLETE BS will run something like the following for a known molid that switches in this way; we can then grep the server logs for that record id `from qcfractal.interface import FractalClient fc = FractalClient(<uri>) optds = fc.get_collection('OptimizationDataset', <dataset_name>) optds.status() opt = optds.df.loc[<molid>, 'b3lyp-d3bj/dzvp'] print(opt.id)`

Action items

Bill Swope will find a COMPLETE > INCOMPLETE molecule case in his own logs, get corresponding OptimizationRecord ID for interrogating server logs; meet with BP and DD on Wednesday for analysis

Bill Swope will increase memory allocation for manager+server submission, as well as memory allocation for manager-spawned workers via manager config.yml; observe if 24-hour failure of workers persists

Meetings