2021-12-13 Core Developers meeting notes

Participants

  • @Chapin Cavender

  • @David Dotson

  • @Matt Thompson

  • @Pavan Behara

  • @Jeffrey Wagner

Discussion topics

Item

Notes

Item

Notes

General updates

  • JW – Working just Monday this week, first half of next week, will be mostly offline other than that.

  • DD – Offline 23rd - 28th. Will be mostly offline.

  •  

Individual updates

  • CC

    • Had a family emergency last week, worked ~half time.

    • Mostly worked on resolving TSCC manager issues. Was due to using up scratch space. Will use shorter runtimes in the future to see if that also helps.

      • DD – More details?

      • CC – It’s showing up as an IOError from Psi4 that says it’s unable to write files.

      • DD – If a manager is killed “unceremoniously” then it won’t do the cleanup, but normally QCEngine will do its own cleanup.

      • CC – I’m seeing ~200GB of scratch space used from the way that I’m running it

      • DD – We’ve seen the same thing on lilac. If a manager is running a long-running calculation then it won’t clean up unless the last thing in the process pool has finished. But if SIGTERM or SIGKILL is received, then it doesn’t actually do the cleanup in the ~30 seconds that are allotted.

      • CC – Most of my calculations have run to their full walltime.

      • DD – I’d like to chat more about this - Working on QCF (specifically the below PR) to make cleaner

      • DD – Do you have a workaround for now?

        • CC – Yes, restarting tseems to help

      • PB – I had a similar issue with running lots of managers on the same node, where each one uses ~40-50GB of scratch, and they exceed the available space. It would be good to ask the sysadmins if there’s another, larger scratch available

      • CC – I can ask that. I’ll also reduce the number of tasks per worker - Currently doing 8 tasks per worker - That may lead to too much I/O and have everything slow each other down.

        • DD – Agree.

    • CC – Currently I’m seeing ~6 optimizations per grid point, so I suspect that the torsiondrives are close to finished. Just getting a few SCF convergence errors, which may be the “correct” output (not due to technical issues)

      • DD – Does this seem close to finished/is it on track?

      • CC – Yes.

      • JW – Do we decide when a grid point is complete, or does TorsionDrive handle this?

        • CC – TD handles this. But there are cases where a grid point won’t converge (systematically fails), and I’m not sure what happens in that case

        • DD – TD will get hung up on bad data points and not be able to generate the scan points on the far side of it. Then we’d have to construct results from the optimizations that DID complete

        • PB – I’m looking at the error logging for the didpeptide 2D torsiondrive dataset, and it seems to be all compute issues, not scientific issues

    • CC – Doing a research cycle on which molecules to include in the implicit solvation QM dataset. So I’m looking at taking a subset of the Sage set and will ping you when I have a candidate set prepared.

  • MT

    • I’ve slept poorly - Neighbor has a new dog.

    • Was out last Fri

    • Reported one of the two big OpenMM/YANK issues. The one I reported was tracked back to being a bug in OpenMM 7.6. It was really clearly a bug once I was able to get a reproducing case, but it was tough because the openmm nightlies are distributed on omnia, and the issue was due to CUDA, so I had to jerry-rig a setup with CUDA on my local nvidia/linux box, and it was a lot of work. This should be resolved in 7.7 - The solution was a really deep single line change in the openMM CUDA code.

    • Worked with developer of mendeleev package to get equality operator for elements. Was done quickly - PR is merged, but no release yet.

    • Looked at Molecule.from_openeye to see how it could be sped up. Still looking, will provide details in the future. Switching from openmm.unit to openff.unit may be causing a performance regression. For example, roughly half of the time is spent in setters, ensuring that the quantity that’s being set is in compatible units with what’s expected.

    • Worked with Meghan Osato from Mobley lab to test out biopolymer infrastructure and interchange. Got great user feedback, will continue that interaction into this week.

    • Worked on endianness issue that LWang had identified - Both reported/fixed on interchange+toolkit.

      • JW – Issue is only triggered if someone serialized an OFFMol containing an array on a non-powerPC machine, and the tried to deserailize it on powerpc, right?

      • MT – Yes

    • MT – Discussions/planning with JW on SMIRNOFF spec/Topology jurisdiction, which makes it hard to know how to implement stuff like charge_from_molecules in a way that’s consistent/follows a spec.

    • MT – Worked with VU last week, and got a mixed system with openFF solvent and a OPLS silica nanoparticle. Still makes funny noises, but it mostly grommps.

  • DD

    • QCArchive

      • we are still hitting submission issues, even locally. Noticed that killing long-running submission from local host appeared hung up on deserialization in qcsubmit, which is very odd. Characterizing and addressing this today.

        • worked with Ben last week to understand gateway timeouts; we found a workaround, though given above not yet sufficient

        • PB – Thanks for pushing on these datasets!

      • last week reviewed and merged all OpenMM sets from Pavan

      • locally computed Chapin's dipeptide set to understand errors, Chapin also ran locally

        • scaled up on PRP and UCSD resources with Chapin

      • continue to work with Ben as a feedback loop on QCFractal refactor

        • also working with him to solution next-gen hardware/network deployment at VT - It’s clear that we’ll hit a storage bottleneck soon, also possibly just a general compute/throughput issue.

        • Helping BP contact appropriate people at VT to look into provisioning machines/contacting folks with existing vendor relations.

        • JW – If we had a pie chart of how much of QCArchive’s storage is OpenFF, then I think this would make it easy to justify us helping with a storage expansion.

        • DD – Could also provide more cores/better throughput, reduce flakiness of access.

    • PLBenchmarks

      • was aiming to revisit diagram and draft narrative doc for architecture, but ran out of time

      • performed some research cycles on specific component options for REST APIs, backend storage, etc. Trying to define the “zoo” of entities that gets passed around, hoping to talk with AWS folks to get their feedback on our specific architecture/plans/needs.

    • Partner Benchmark

      • no updates at this time

  • PB

    • Resubmitting openmm datasets with the wcombine keyword, and also the multi-component ones

    • Some wbo work looking at biphenyls in ChemBL-29 and wbos from AM1, XTB at QM conf and TK generated confs.

      • JW – I really liked the data you showed in last week’s ff-release call, where including more protonation states led to a much larger range of WBOs.

      • PB – I’m going to be looking at the QM datasets we have for the existing FFs, and determine whether they have reasonable protonation states.

      • JW – If we were really deficient in charged species in our FF training sets, then I’d expect that benchmarking would have shown worse performance for charged molecules. So I’ll ask about this in the benchmarking channel on slack.

      • PB – I’ve looked at the gen2 training sets so far, will also look into industry benchmark sets.

    • PB – Did anyone optimize improper torsions before? Looking for advice on how to help JMaat start a dataset for improper fits.

      • CC – SB did improper torsiondrives for amides.

      • PB – Also saw them in the aniline dataset.

      • PB – Also not sure whether qcsubmit can do everything we want for improper grid optimization. Will bespokefit optimize impropers?

        • DD – Not sure. Could open an issue on bespokefit or ask on Slack.

  • JW –

    • Getting great feedback from Meghan Osato (Mobley lab) on biopolymer infrastructure and interchange

    • Fixed OE stereochem bug and error with default partial_bondorder_model (OFFTK #1150-1153)

    • Interviewed project manager candidates - Hire will hopefully come online in Jan/early 2022.

    • Worked on a slightly different way to do Topology.to_smiles, not sure if we also will need Topology.from_mapped_smiles

    •  

Action items

Decisions