Individual updates | CC Had a family emergency last week, worked ~half time. Mostly worked on resolving TSCC manager issues. Was due to using up scratch space. Will use shorter runtimes in the future to see if that also helps. DD – More details? CC – It’s showing up as an IOError from Psi4 that says it’s unable to write files. DD – If a manager is killed “unceremoniously” then it won’t do the cleanup, but normally QCEngine will do its own cleanup. CC – I’m seeing ~200GB of scratch space used from the way that I’m running it DD – We’ve seen the same thing on lilac. If a manager is running a long-running calculation then it won’t clean up unless the last thing in the process pool has finished. But if SIGTERM or SIGKILL is received, then it doesn’t actually do the cleanup in the ~30 seconds that are allotted. CC – Most of my calculations have run to their full walltime. DD – I’d like to chat more about this - Working on QCF (specifically the below PR) to make cleaner
Github link macro |
---|
link | https://github.com/MolSSI/QCFractal/pull/700 |
---|
|
DD – Do you have a workaround for now? PB – I had a similar issue with running lots of managers on the same node, where each one uses ~40-50GB of scratch, and they exceed the available space. It would be good to ask the sysadmins if there’s another, larger scratch available CC – I can ask that. I’ll also reduce the number of tasks per worker - Currently doing 8 tasks per worker - That may lead to too much I/O and have everything slow each other down.
CC – Currently I’m seeing ~6 optimizations per grid point, so I suspect that the torsiondrives are close to finished. Just getting a few SCF convergence errors, which may be the “correct” output (not due to technical issues) DD – Does this seem close to finished/is it on track? CC – Yes. JW – Do we decide when a grid point is complete, or does TorsionDrive handle this? CC – TD handles this. But there are cases where a grid point won’t converge (systematically fails), and I’m not sure what happens in that case DD – TD will get hung up on bad data points and not be able to generate the scan points on the far side of it. Then we’d have to construct results from the optimizations that DID complete PB – I’m looking at the error logging for the didpeptide 2D torsiondrive dataset, and it seems to be all compute issues, not scientific issues
CC – Doing a research cycle on which molecules to include in the implicit solvation QM dataset. So I’m looking at taking a subset of the Sage set and will ping you when I have a candidate set prepared.
MT I’ve slept poorly - Neighbor has a new dog. Was out last Fri Reported one of the two big OpenMM/YANK issues. The one I reported was tracked back to being a bug in OpenMM 7.6. It was really clearly a bug once I was able to get a reproducing case, but it was tough because the openmm nightlies are distributed on omnia, and the issue was due to CUDA, so I had to jerry-rig a setup with CUDA on my local nvidia/linux box, and it was a lot of work. This should be resolved in 7.7 - The solution was a really deep single line change in the openMM CUDA code. Worked with developer of mendeleev package to get equality operator for elements. Was done quickly - PR is merged, but no release yet. Looked at Molecule.from_openeye to see how it could be sped up. Still looking, will provide details in the future. Switching from openmm.unit to openff.unit may be causing a performance regression. For example, roughly half of the time is spent in setters, ensuring that the quantity that’s being set is in compatible units with what’s expected. Worked with Meghan Osato from Mobley lab to test out biopolymer infrastructure and interchange. Got great user feedback, will continue that interaction into this week. Worked on endianness issue that LWang had identified - Both reported/fixed on interchange+toolkit. MT – Discussions/planning with JW on SMIRNOFF spec/Topology jurisdiction, which makes it hard to know how to implement stuff like charge_from_molecules in a way that’s consistent/follows a spec. MT – Worked with VU last week, and got a mixed system with openFF solvent and a OPLS silica nanoparticle. Still makes funny noises, but it mostly grommps.
DD QCArchive we are still hitting submission issues, even locally. Noticed that killing long-running submission from local host appeared hung up on deserialization in qcsubmit, which is very odd. Characterizing and addressing this today. worked with Ben last week to understand gateway timeouts; we found a workaround, though given above not yet sufficient PB – Thanks for pushing on these datasets!
last week reviewed and merged all OpenMM sets from Pavan locally computed Chapin's dipeptide set to understand errors, Chapin also ran locally continue to work with Ben as a feedback loop on QCFractal refactor also working with him to solution next-gen hardware/network deployment at VT - It’s clear that we’ll hit a storage bottleneck soon, also possibly just a general compute/throughput issue. Helping BP contact appropriate people at VT to look into provisioning machines/contacting folks with existing vendor relations. JW – If we had a pie chart of how much of QCArchive’s storage is OpenFF, then I think this would make it easy to justify us helping with a storage expansion. DD – Could also provide more cores/better throughput, reduce flakiness of access.
PLBenchmarks was aiming to revisit diagram and draft narrative doc for architecture, but ran out of time performed some research cycles on specific component options for REST APIs, backend storage, etc. Trying to define the “zoo” of entities that gets passed around, hoping to talk with AWS folks to get their feedback on our specific architecture/plans/needs.
Partner Benchmark
PB Resubmitting openmm datasets with the wcombine keyword, and also the multi-component ones Some wbo work looking at biphenyls in ChemBL-29 and wbos from AM1, XTB at QM conf and TK generated confs. JW – I really liked the data you showed in last week’s ff-release call, where including more protonation states led to a much larger range of WBOs. PB – I’m going to be looking at the QM datasets we have for the existing FFs, and determine whether they have reasonable protonation states. JW – If we were really deficient in charged species in our FF training sets, then I’d expect that benchmarking would have shown worse performance for charged molecules. So I’ll ask about this in the benchmarking channel on slack. PB – I’ve looked at the gen2 training sets so far, will also look into industry benchmark sets.
PB – Did anyone optimize improper torsions before?
JW – Getting great feedback from Meghan Osato (Mobley lab) on biopolymer infrastructure and interchange Fixed OE stereochem bug and error with default partial_bondorder_model (OFFTK #1150-1153) Interviewed project manager candidates - Hire will hopefully come online in Jan/early 2022. Worked on a slightly different way to do Topology.to_smiles , not sure if we also will need Topology.from_mapped_smiles
|