2021-02-16 QCFractal Users Meeting notes

Date

Feb 16, 2021

Participants

  • @Jeffrey Wagner

  • Heejune Park

  • Ben Pritchard

  • @David Dotson

  • @Hyesu Jang

  • @Joshua Horton

  • @Trevor Gokey

Discussion topics

Item

Notes

Item

Notes

Updates from MolSSI

  • BP – No upcoming releases/outages planned

  • BP – Refactor is going well. Test server will come online soon – We’d like you to use this to see if anything breaks. Test server will be “fresh”/will not contain existing outputs.

  • DD – I could set up a tracking board just for that server, and have some of the QCA dataset submissions go to the test server.

  • BP – I’d like for this server to be something like a continuously-deployed version from master.

Queue status

QCA Dataset Tracking board:

  • DD – Been working on PEPCONF. Ran it for about 1.5 weeks with dlc coordinates and reset=True. We’ve had about 2000 complete out of 7000 jobs. The rest are errored. Focusing on two “unknown” errors.

  • DD – Queue is emptying out. We’ve been chewing on some really large sets like Cerutti’s protein fragments, but they’re nearly done.

  • DD – There are some pending PRs that are close to submission. I’ll prioritize getting those in and following up with submitters. I also need to submit the pharma partner benchmarks sets.

  • JW – Will the new submission standards be changed to have dlc?

    • DD + TG – Yes. There’s a variant called default-dlc that can be used (since we’ll still want tric for multi-molecule submissions.

  • SB – I’d be interested in a more targeted set for WBOs.

    • DD – The queue should be totally open

    • SB and PB will sync up to prepare a cleaner WBO set.

    • TG – SB, there’s also a new set intended to study WBOs submitted by JM and Chris Bayly.

User questions

  • HP – Re: PEPCONF. You’re using DLC now for pepconf. Have you tried using other internal coordinate systems other than tric or dlc? (or cartesian coordinates)? Does the time per step change with choice of coordinates? That is, if each iteration takes more time, the saving might now be great overall.

    • DD – We haven’t checked this yes.

    • HP – The advantage of TRIC should be that it can converge faster.

    • (General) – OpenFF hasn't tried coordinate systems other than tric and dlc. In theory, the SCF convergence should be the expensive part, not the coordinate transformations/updates.

    • HP will with with DD on trying different solutions for PEPCONF issues.

  • HP – Regarding QCFractal’s built-in job deduplication. It will store and deduplicate job outputs, and also allow for importing jobs completed elsewhere.

    • BP – QCFractal only deduplicates on input – It looks at (program, method, bases, keywords, molecule), and will identify duplicates if those match. The molecule is stores as a QCSchema molecule (also referred to as a QCElemental molecule). This can store connectivity (with bond order).

  • PB – In pepconf, some runs were failing with SCF convergence werrrors because density didn’t converge. Trying out second-order methods worked pretty well. Would it be OK to use second-order SCF methods as a backup?

    • BP – I don’t see a problem with that, though QCA will see that as a different calculation.

    • PB – Second-order methods and damping helped a lot of our failures converge.

    • DD – We could have our datasets be something like a frankenstein of jobs run with slightly different settings.

    • JW – How does cost scale with second-order methods?

    • PB – Goes from N^4 to N^6

    • TG – Would be interested in finding a workflow/combination of settings that uses second-order when appropriate.

    • PB – Daniel Smith also mentioned that there’s an issue with getting stuck in local minima when using second-order methods. http://forum.psicode.org/t/orbital-gradient-rms-convergence-issue/342

    •  

PEPCONF sync-up

  • TG: Presenting results from his detailed analysis of PEPCONF failures: QCE/Psi4 notes

    • primarily focused on memory usage as hypothesis

    • finds at least one case where QCEngine Unknown Error occurs but memory limits well-obeyed

    • appears that psi4 is failing on a step, and it may be a random error, since DD was able to get 34754174 to complete, but TG saw failures at different optimization steps

      • psi4 yields None for that last step, not clear why it fails

  • Next steps:

    • TG: will run the failed optimization step observed as a standalone qcengine.compute call

      • DD: Try running ~100 times, see what fraction fail; would establish whether this is indeed a random error and give a sense of its degree

  • TG: not getting any stdout for psi4 failed optimization step

    • BP+DD: surprised by this, should see something; perhaps if psi4 is crashing hard it’s not giving back stdout

    • DD: Psi4Harness could be improved here; looks like it drops stdout if success is false for our codepath [followup]

  • PB: SCF convergence

    • Ran 3 additional cases of SCF failures;

      • 34754755: converges with soscf: true

      • 34754734: converges with soscf: true

      • 34754962: converges with soscf: true

    • Ran 34754734 locally and on the cluster

      • locally constrained with 4GiB memory; and failed with QCEngine Unknown Error

      • Also noticed difference in energy values in laptop vs. cluster run

        • TG: appears to be because algorithm is Core on cluster, Disk on local; may be due to precision differences when doing writes/reads to disk vs. keeping everything in memory

  • DD: working conclusion:

    • memory constraints can cause QCEngine Unknown Error

    • we also see at least one other mode caused by (so far, apparently) random psi4 errors for challenging molecules

  • HP: would like access to Confluence to see Trevor’s analysis

 

  • PB: for Trevor, how do you change task details in your script

    • TG: should be able to, it’s a dictionary

    • DD: if you find you need to change a pydantic object like task, can call task.dict() to get out a dictionary form

Action items

@David Dotson will set up test-track Dataset Tracking board for routing selected submissions to test server
@David Dotson will schedule a time to work with Heejune for a crash course on accessing QCArchive data
@Simon Boothroyd and @Pavan Behara will prepare a new, informative, and fairly large WBO set
@Trevor Gokey will repeatedly run psi4 gradient calculation for optimization step(s) that failed from his detailed testing with PEPCONF; collect statistics on frequency of failure vs. success
@Pavan Behara may continue trying different SCF parameter choices with SCF-errored PEPCONF; offer recommendations on paths forward we could try for new datasets; perhaps even experiment with new PEPCONF spec submissions on subsets of molecules
@David Dotson will consider ways we might iteratively apply more expensive compute specs to failing cases to yield complete datasets when merged across compute specs

Decisions