BP – Refactor is going well. Test server will come online soon – We’d like you to use this to see if anything breaks. Test server will be “fresh”/will not contain existing outputs.
DD – I could set up a tracking board just for that server, and have some of the QCA dataset submissions go to the test server.
BP – I’d like for this server to be something like a continuously-deployed version from master.
Queue status
QCA Dataset Tracking board:
DD – Been working on PEPCONF. Ran it for about 1.5 weeks with dlc coordinates and reset=True. We’ve had about 2000 complete out of 7000 jobs. The rest are errored. Focusing on two “unknown” errors.
DD – Trying various ways to measure memory usage. None of them seem unreasonable.
TG – I got unknown errors for these two jobs. QCE/Psi4 notes
DD – Queue is emptying out. We’ve been chewing on some really large sets like Cerutti’s protein fragments, but they’re nearly done.
DD – There are some pending PRs that are close to submission. I’ll prioritize getting those in and following up with submitters. I also need to submit the pharma partner benchmarks sets.
JW – Will the new submission standards be changed to have dlc?
DD + TG – Yes. There’s a variant called default-dlc that can be used (since we’ll still want tric for multi-molecule submissions.
SB – I’d be interested in a more targeted set for WBOs.
DD – The queue should be totally open
SB and PB will sync up to prepare a cleaner WBO set.
TG – SB, there’s also a new set intended to study WBOs submitted by JM and Chris Bayly.
User questions
HP – Re: PEPCONF. You’re using DLC now for pepconf. Have you tried using other internal coordinate systems other than tric or dlc? (or cartesian coordinates)? Does the time per step change with choice of coordinates? That is, if each iteration takes more time, the saving might now be great overall.
DD – We haven’t checked this yes.
HP – The advantage of TRIC should be that it can converge faster.
(General) – OpenFF hasn't tried coordinate systems other than tric and dlc. In theory, the SCF convergence should be the expensive part, not the coordinate transformations/updates.
HP will with with DD on trying different solutions for PEPCONF issues.
HP – Regarding QCFractal’s built-in job deduplication. It will store and deduplicate job outputs, and also allow for importing jobs completed elsewhere.
BP – QCFractal only deduplicates on input – It looks at (program, method, bases, keywords, molecule), and will identify duplicates if those match. The molecule is stores as a QCSchema molecule (also referred to as a QCElemental molecule). This can store connectivity (with bond order).
PB – In pepconf, some runs were failing with SCF convergence werrrors because density didn’t converge. Trying out second-order methods worked pretty well. Would it be OK to use second-order SCF methods as a backup?
BP – I don’t see a problem with that, though QCA will see that as a different calculation.
PB – Second-order methods and damping helped a lot of our failures converge.
DD – We could have our datasets be something like a frankenstein of jobs run with slightly different settings.
JW – How does cost scale with second-order methods?
PB – Goes from N^4 to N^6
TG – Would be interested in finding a workflow/combination of settings that uses second-order when appropriate.
TG: Presenting results from his detailed analysis of PEPCONF failures: QCE/Psi4 notes
primarily focused on memory usage as hypothesis
finds at least one case where QCEngine Unknown Error occurs but memory limits well-obeyed
appears that psi4 is failing on a step, and it may be a random error, since DD was able to get 34754174 to complete, but TG saw failures at different optimization steps
psi4 yields None for that last step, not clear why it fails
Next steps:
TG: will run the failed optimization step observed as a standalone qcengine.compute call
DD: Try running ~100 times, see what fraction fail; would establish whether this is indeed a random error and give a sense of its degree
TG: not getting any stdout for psi4 failed optimization step
BP+DD: surprised by this, should see something; perhaps if psi4 is crashing hard it’s not giving back stdout
DD: Psi4Harness could be improved here; looks like it drops stdout if success is false for our codepath [followup]
PB: SCF convergence
Ran 3 additional cases of SCF failures;
34754755: converges with soscf: true
34754734: converges with soscf: true
34754962: converges with soscf: true
Ran 34754734 locally and on the cluster
locally constrained with 4GiB memory; and failed with QCEngine Unknown Error
Also noticed difference in energy values in laptop vs. cluster run
TG: appears to be because algorithm is Core on cluster, Disk on local; may be due to precision differences when doing writes/reads to disk vs. keeping everything in memory
DD: working conclusion:
memory constraints can cause QCEngine Unknown Error
we also see at least one other mode caused by (so far, apparently) random psi4 errors for challenging molecules
HP: would like access to Confluence to see Trevor’s analysis
PB: for Trevor, how do you change task details in your script
TG: should be able to, it’s a dictionary
DD: if you find you need to change a pydantic object like task, can call task.dict() to get out a dictionary form
Add Comment