2021-02-16 QCFractal Users Meeting notes

Date

16 Feb 2021

Participants

Discussion topics

Item	Notes
Updates from MolSSI	BP – No upcoming releases/outages planned BP – Refactor is going well. Test server will come online soon – We’d like you to use this to see if anything breaks. Test server will be “fresh”/will not contain existing outputs. DD – I could set up a tracking board just for that server, and have some of the QCA dataset submissions go to the test server. BP – I’d like for this server to be something like a continuously-deployed version from master.
Queue status	QCA Dataset Tracking board: DD – Been working on PEPCONF. Ran it for about 1.5 weeks with dlc coordinates and reset=True. We’ve had about 2000 complete out of 7000 jobs. The rest are errored. Focusing on two “unknown” errors. Working on a few hypotheses DD and TG are looking into memory usage PB is looking into SCF convergence errors DD – Running some of these jobs locally to reproduce – 2021-02-12 QCArchive - PEPCONF Investigation 2 Meeting notes DD – Trying various ways to measure memory usage. None of them seem unreasonable. TG – I got unknown errors for these two jobs. QCE/Psi4 notes DD – Queue is emptying out. We’ve been chewing on some really large sets like Cerutti’s protein fragments, but they’re nearly done. DD – There are some pending PRs that are close to submission. I’ll prioritize getting those in and following up with submitters. I also need to submit the pharma partner benchmarks sets. JW – Will the new submission standards be changed to have `dlc`? DD + TG – Yes. There’s a variant called `default-dlc` that can be used (since we’ll still want `tric` for multi-molecule submissions. SB – I’d be interested in a more targeted set for WBOs. DD – The queue should be totally open SB and PB will sync up to prepare a cleaner WBO set. TG – SB, there’s also a new set intended to study WBOs submitted by JM and Chris Bayly.
User questions	HP – Re: PEPCONF. You’re using DLC now for pepconf. Have you tried using other internal coordinate systems other than `tric` or `dlc`? (or cartesian coordinates)? Does the time per step change with choice of coordinates? That is, if each iteration takes more time, the saving might now be great overall. DD – We haven’t checked this yes. HP – The advantage of TRIC should be that it can converge faster. (General) – OpenFF hasn't tried coordinate systems other than tric and dlc. In theory, the SCF convergence should be the expensive part, not the coordinate transformations/updates. HP will with with DD on trying different solutions for PEPCONF issues. HP – Regarding QCFractal’s built-in job deduplication. It will store and deduplicate job outputs, and also allow for importing jobs completed elsewhere. BP – QCFractal only deduplicates on input – It looks at (program, method, bases, keywords, molecule), and will identify duplicates if those match. The molecule is stores as a QCSchema molecule (also referred to as a QCElemental molecule). This can store connectivity (with bond order). PB – In pepconf, some runs were failing with SCF convergence werrrors because density didn’t converge. Trying out second-order methods worked pretty well. Would it be OK to use second-order SCF methods as a backup? BP – I don’t see a problem with that, though QCA will see that as a different calculation. PB – Second-order methods and damping helped a lot of our failures converge. DD – We could have our datasets be something like a frankenstein of jobs run with slightly different settings. JW – How does cost scale with second-order methods? PB – Goes from N^4 to N^6 TG – Would be interested in finding a workflow/combination of settings that uses second-order when appropriate. PB – Daniel Smith also mentioned that there’s an issue with getting stuck in local minima when using second-order methods. http://forum.psicode.org/t/orbital-gradient-rms-convergence-issue/342
PEPCONF sync-up	TG: Presenting results from his detailed analysis of PEPCONF failures: QCE/Psi4 notes primarily focused on memory usage as hypothesis finds at least one case where `QCEngine Unknown Error` occurs but memory limits well-obeyed appears that `psi4` is failing on a step, and it may be a random error, since DD was able to get 34754174 to complete, but TG saw failures at different optimization steps `psi4` yields `None` for that last step, not clear why it fails Next steps: TG: will run the failed optimization step observed as a standalone `qcengine.compute` call DD: Try running ~100 times, see what fraction fail; would establish whether this is indeed a random error and give a sense of its degree TG: not getting any stdout for `psi4` failed optimization step BP+DD: surprised by this, should see something; perhaps if `psi4` is crashing hard it’s not giving back `stdout` DD: `Psi4Harness` could be improved here; looks like it drops stdout if success is false for our codepath [followup] PB: SCF convergence Ran 3 additional cases of SCF failures; 34754755: converges with soscf: true 34754734: converges with soscf: true 34754962: converges with soscf: true Ran 34754734 locally and on the cluster locally constrained with 4GiB memory; and failed with `QCEngine Unknown Error` Also noticed difference in energy values in laptop vs. cluster run TG: appears to be because algorithm is `Core` on cluster, `Disk` on local; may be due to precision differences when doing writes/reads to disk vs. keeping everything in memory DD: working conclusion: memory constraints can cause `QCEngine Unknown Error` we also see at least one other mode caused by (so far, apparently) random `psi4` errors for challenging molecules HP: would like access to Confluence to see Trevor’s analysis
	PB: for Trevor, how do you change task details in your script TG: should be able to, it’s a dictionary DD: if you find you need to change a pydantic object like `task`, can call `task.dict()` to get out a dictionary form

Date

Participants

Discussion topics

Action items

Decisions

0 Comments