2021-02-12 QCArchive - PEPCONF Investigation 2 Meeting notes

Date

Feb 12, 2021

Participants

  • @David Dotson

  • @Pavan Behara

  • @Trevor Gokey

Goals

  • Updates from Pavan, David

  • Next steps

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

Updates

Pavan, David

  • PB: running cases 34752766, 34754174; not completing

  • DD: also running these cases; attempting to falsify two hypotheses

  • PB: also explored SCF parameter adjustments in 34752921:

    • get successful convergence with a change in parameters:

    • soscf set to true, but is more compute intensive

      • TG: perhaps needs a Hessian every step? Need to verify with DGS

      • DD: Pavan, can you try a few more SCF convergence error cases and see if this improves things for those?

        • PB: will run '34754734', '34754755', '34754962'

  • DD: we are pursuing a couple different hypotheses (which could both be true) for the QCEngine Unknown Error cases:

    1. Hypothesis: QCEngine Unknown Errors are largely due to memory exhaustion from geomeTRIC.

      1. DD pursuing

      2. if this is the case, addressing the climbing memory issues that Trevor identified will help.

      3. psi4 issues are perhaps a red herring

      4. psi4 would be getting killed by the operating system in these cases, along with the calling Python process

      5. in these cases, is it surprising that we get an error at all?

        • unclear

        • can we simulate this locally by limiting the memory to the process with cgroups, as SLURM (and I think K8s) would do?

      6. evidence:

        • [not gathered] if we memory profile the local executions we are currently performing,
          we would see this manifested

    2. Hypothesis: QCEngine Unknown Errors are largely due to errors in psi4, possibly involving SCF errors

      1. PB pursuing

      2. evidence:

        • We see at least one case for 34752921 where it yields QCEngine Unknown Error and yet shows indications of an SCF convergence error

Trevor profiling geometric to pin down B-matrices issue

Trevor

  • Profiling specifically the B-matrix dictionary to observer how large it grows in terms of memory usage

Next steps

 

  • We’ll reconvene on Tuesday at end of QCA user meeting

    • share other observations via Slack

  • TG: will run modified geomeTRIC , with output for B-matrix space usage

    • DD: for your convenience:

    • #!/usr/bin/env python import sys import os from qcfractal.interface import FractalClient from qcengine import compute_procedure id = sys.argv[1] os.makedirs(id, exist_ok=True) cl = FractalClient.from_file() task = cl.query_tasks(base_result=id)[0] print(task) with open(os.path.join(id, 'task.json'), 'w') as f: f.write(task.json()) result = compute_procedure(*task.spec.args) print('here') print(result) with open(os.path.join(id, 'result.json'), 'w') as f: f.write(result.json())

       

  • PB: continue investigating possible SCF parameter adjustments, try on other SCF-failure cases

  • DD: analyze distribution of error tracebacks across resources; attempt to falsify geomeTRIC memory hypothesis

    • check psi4 logs from local runs, see how much memory they actually use

    • see if cases complete locally

    • experiment with cgroups to simulate memory constraints on clusters

Action items

@Pavan Behara will continue investigating possible SCF parameter adjustments, try on other SCF-failure cases
@Trevor Gokey will run modified geomeTRIC , with output for B-matrix space usage; possibly try other memory profiling methods
@David Dotson will analyze distribution of error tracebacks across resources; attempt to falsify geomeTRIC memory hypothesis, see if cases complete locally; experiment with cgroups to simulate memory constraints on clusters

Decisions