2021-03-15 QCF Troubleshooting Meeting notes

Date

Mar 15, 2021

Participants

  • @David Dotson

  • Bill Swope

  • Ben Pritchard

Goals

  • Troubleshoot COMPLETE > INCOMPLETE cases in usage of QCArchive at Genentech for benchmark geometry optimizations

  • Identify where in the server codebase these issue can arise

  • Assess whether upcoming server codebase changes will mitigate or eliminate this issue

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

Evidences and observations

Bill

  • Observing “Bus error” for some SLURM jobs

  • Observing OOM on some SLURM jobs; these are for combined launches of qcfractal-server and qcfractal-manager

Assessment and proposed solutions

Ben + David

  • BP: worried multiple server and postgres process running at once; could happen if two different nodes see the same postgres persistent files on a network filesystem

    • BS: nope, only running once at a time

  • For 24-hour problem, could be that manager+server are running out of memory, causing workers to die when manager gets killed

    • BS: will increase memory allocation to manager+server

    • will also increase memory allocation per worker in manager config

    • If worker processes continue to die within some short time window, BS will log into compute node where server+manager are running and see if one of those processes have died

COMPLETE > INCOMPLETE

Bill

  • Want to interrogate the server logs next for an optimization record that goes from COMPLETE > INCOMPLETE

  • BS will run something like the following for a known molid that switches in this way; we can then grep the server logs for that record id

    from qcfractal.interface import FractalClient fc = FractalClient(<uri>) optds = fc.get_collection('OptimizationDataset', <dataset_name>) optds.status() opt = optds.df.loc[<molid>, 'b3lyp-d3bj/dzvp'] print(opt.id)

Action items

Bill Swope will find a COMPLETE > INCOMPLETE molecule case in his own logs, get corresponding OptimizationRecord ID for interrogating server logs; meet with BP and DD on Wednesday for analysis
Bill Swope will increase memory allocation for manager+server submission, as well as memory allocation for manager-spawned workers via manager config.yml; observe if 24-hour failure of workers persists

Decisions