2021-07-23 QC meeting notes

Participants

  • @Jeffrey Wagner

  • @Pavan Behara

  • Ben Pritchard

  • @David Dotson

  • @Chapin Cavender

  • @Heejune Park

  • @Joshua Horton

Goals

  • User questions/issues, new submissions

  • Science support needs

  • Infrastructure needs / advances

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

Complete submissions

David

  • qca-ds-sub#215 complete

Running submissions

David

  • DD – BCC refit study → Low CPU resource usage. Any ideas as to why this could be?

    • SB – It could be that the PCM solver is single-threaded.

    • DD – Lots of SCF errors on the BCC refit set.

    • SB – This is the first set where we’ve passed kwargs through QCEngine. Maybe we need a finer grid spacing or some other change?

    • DD – I’ll set it to high and we can gopefully get more error messages

    • (JH, later in meeting) – It looks like the “version 2” dataset is optimization, whereas “version 1” was single points. So we should check with SB about whether this was intentional.

      • DD – Is it possible that ti takes more than 200 steps to converge?

      • BP – 200 steps is a lot. In my experience, solvation shouldn’t be that radically different. So I’d think that they shouldn’t be having this issue.

  • PB – You can make the MP2 submission priority:normal

  • SB – We can also make the aniline impropers into priority:normal. We probably need a different general strategy for this.

New submissions

 

  • SB: not sure how expensive 2d protein scans are

    • should we put together a test set that assesses expense?

    • good exercise for Chapin?

    • CC: that is along the lines of what I’m thinking

      • would be happy to drive that

    • JW: agree, could see this as a way to do bounds-checking

      • e.g. pick tripeptides or dipeptides, submitting 2d torsion-scan with these

    • SB: could be done outside of main QCArchive as well, might be useful as a local HPC execution with own server

    • CC – Could run in Triton cluster at UCSD

    • SB – Could we run these on a separate QCA queue? Like openff-test?

      • DD – Yes. This would be a good idea.

    • JW: this would be a way to probe how much pre-emptible resources are costing us in terms of wasted error cycling, since dedicated resources wouldn’t be pre-emptible

  • Next step: CC should work on opening a PR to qca-dataset submission. Can work with DD. SB had previously submitted a 2D scan, but it was proper+improper. Some recent changes to QCSubmit should make it easier to do 2D torsiondrives.

  • CC and DD will schedule a working session to make this initial submission.

  • CC – How should we check out the “torsionscan using fast method, followed by high-accuracy QM from that starting point” idea?

    • DD – Let’s do the default settings for the first submission. After that, we can look into strategies for doing the two-tier approach. It should be fairly straightforward, but it’ll be good to keep it simple for now.

Industry Benchmarks COMPLETE → ERROR?

David

  • DD – Sometimes the COMPLETE numbers drop on subsequent days. Bill Swope at Genentech had seen something like this as well. Is it possible that we have duplicate molecules, where a first job succeeds, but another instance of it fails?

    • BP – Could be something in the status code?

    • DD – could this be a case of more than one task in the task queue pointing to the same result record?

    • BP – It’s impossible in our database for two tasks to point to the same result.

    • DD : this rules out that explanation; I’ll investigate if we see a difference in status reporting between lifecycle and ds.status

  • BP – When the the drop in the number of completes happen?

    • DD – Between July 5 and July 6 (between 5:46 AM Pacific on the 5th and 5:49 AM Pacific July 6). It looks like they turned into errors, though those could have been recycled to INCOMPLETE

    • BP – I may have done a recycling of tasks assigned to inactive managers, so that may be something like the root cause.

Infrastructure support needs



  • SB – Next steps for torsiondrives in QCEngine?

    • SB – We could vendor this code if it doesn’t seem like QCEngine#305 will be approved. But I could use some feedback/insight on how it’s looking to the maintainers.

    • BP – I can take a look at this, but even after this gets into QCEngine, it’s unlikely it’d be available very soon in QCF. The big thing there is that the results would be a different shape than what QCFractal usually receives.

    • DD – We don’t need this implementation in QCFractal; can continue with existing service-based implementation

    • DD : will ping Lori for a quick review, or at least no objections

  • HP: I recently enabled the NEB method in geomeTRIC, and I want to run it in QCFractal. So I updated run_json.py. What’s the pathway to enable this in QCPortal/QCArchive?

    • Basically, in geomeTRIC, there’s a new file neb.py. In the nudged elastic band method, we start with the molecule stretched out, and we gradually minimize it (basically, running many constrained optimizations). So the input has to look something like a trajectory.

    • DD – Is this geomeTRIC #132 (titled NEB update)

    • HP – Yes. I was able to run this using QCEngine.

    • HP – My goal is for the user to submit a trajectory through QCPortal, and then for the calculations to be automated.

    • DD – It seems like there would be many steps to getting this to work.

    • HP – In my local testing, I’ve tried to get it running without making any changes to QCEngine. But it seems like I’ll need to make changes to QCElemental to accept a “chain” of structures (like a trajectory).

Issues from outage the other week?

Ben Pritchard

  • BP – Did any issues arise from the outage last weekend?

    • DD – We have a submission (The industry benchmark MM set \ QCA-dataset-submission #208). Initially (in June) we got connection refused, then “too large” issues, then BP increased the submission size limit. Now getting:

      • requests.exceptions.ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
    • DD – I tried to reproduce this locally on my machines, went way down into the stack. Looks like an http problem at some level.

    • DD – My workaround is that, since all the datasets are on there, but we never called “ds.compute”, I’m looping through and trying to run compute on each one. So it eventually works, but there’s a lot of cross-communication (like, one request for each of the 75k jobs?)

    • BP – From my side, I’m seeing timeouts and “connection reset by peer” with nginx

    • BP and DD will continue debugging this

Action items

@David Dotson and @Chapin Cavender will meet for a working session on assembling protein torsiondrive dataset
@David Dotson will debug submission failure of additional compute specs on large industry benchmark set with Ben Pritchard

Decisions