Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Date

Participants

Goals

  • Troubleshoot COMPLETE > INCOMPLETE cases in usage of QCArchive at Genentech for benchmark geometry optimizations

  • Identify where in the server codebase these issue can arise

  • Assess whether upcoming server codebase changes will mitigate or eliminate this issue

Discussion topics

Item

Presenter

Notes

Evidences and observations

Bill

  • Observing “Bus error” for some SLURM jobs

  • Observing OOM on some SLURM jobs; these are for combined launches of qcfractal-server and qcfractal-manager

Assessment and proposed solutions

Ben + David

  • BP: worried multiple server and postgres process running at once; could happen if two different nodes see the same postgres persistent files on a network filesystem

    • BS: nope, only running once at a time

  • For 24-hour problem, could be that manager+server are running out of memory, causing workers to die when manager gets killed

    • BS: will increase memory allocation to manager+server

    • will also increase memory allocation per worker in manager config

    • If worker processes continue to die within some short time window, BS will log into compute node where server+manager are running and see if one of those processes have died

COMPLETE > INCOMPLETE

Bill

  • Want to interrogate the server logs next for an optimization record that goes from COMPLETE > INCOMPLETE

  • BS will run something like the following for a known molid that switches in this way; we can then grep the server logs for that record id

    from qcfractal.interface import FractalClient
    
    fc = FractalClient(<uri>)
    optds = fc.get_collection('OptimizationDataset', <dataset_name>)
    
    optds.status()
    
    opt = optds.df.loc[<molid>, 'b3lyp-d3bj/dzvp']
    print(opt.id)

Action items

  •  

Decisions

  • No labels