2022-03-08 QC meeting notes

Participants

@Joshua Horton
@Trevor Gokey
@Chapin Cavender
@Jeffrey Wagner
@David Dotson
Ben Pritchard

Goals

Updates from MolSSI
- refactor updates
Infrastructure needs/advances
- New optimization schema with scf properties evaluated at optimized geometry
  - one extra step of doing single point at the final geometry, acceptable cost, any thoughts?
- New default policy for optimization trajectory storage?
- New QCEngine release with QCEngine#351
Throughput status
- OpenFF ESP Industry Benchmark Set v1.0: 56030 completed optimizations, 24 systematic errors, ~99.96% complete; moved to “End of Life”
- Openff dipeptides torsiondrives v2.1: 24/26 TD complete
  - slowmoving - 812 opts since last week, around 152 new opts for the last two modified submissions, looks like going in the right direction, almost complete
  - CC: Out of two one may error out so if we reach 25/26 we can move this to end of life
- OpenFF Protein Capped 1-mer Sidechains v1.0: 1/46 TD
  - 84564 from 73679 (upwards of 10885) optimizations last week
  - around 1800+ opts per torsionscan done, nearly 3 opts per grid point (575 grid points IIRC)
- SPICE PubChem Set 1 Single Points Dataset v1.2: 65.6% from 8.5% last week; 66,560 optimizations from last week
  - Lilac compute fully dedicated to spice sets now
New submissions
- Pubchem set2 submitted
User questions/issues
Science support needs

Discussion topics

Item	Notes

Item	Notes
QCArchive status	BP – Freed up a ton of space - Removed lots of wavefunctions, now 52% full on the SSD moved remaining wavefunctions to spinning disk Down to 52% disk utilization. Datasets themselves (as in collections) are still present and visible from client, but records associated are gone BP – good to take a policy of asking for wavefunctions sparingly BP – We’d been looking at funding sources for a new server , but due to confusion about grant timing we didn’t submit in time. BP – still looking into a new server, since we as a policy want to have plenty of free space, not hitting >90%
Refactor status	BP – how do you guys want the refactor to be put in front of you? JW – is there room to have both the new and old DB in place at the same time? BP – not on the prod server, no BP – could put it on the “molssi10” host DD – If we want to use the new host for the deployment, maybe we could move production QCA to “molssi10”, then http://qcarchive.molssi.org could rediret to that, and the migrated db could live on the “real” server, and be accessed through a different address. BP – That could work. The current db has `/v1` in all the API calls, whereas the new one will have `/v2`, so I could use that to determine redirects. DD – We’ll want to coordinate a little bit on when we’d want to switch over writes from one to the other. DD – In the meantime, is there a test host that maybe doesn’t have all the data, but folks on this call could hammer on it for an hour or two? DO the folks on this call have time for that? We’d like to smooth out any pain points before it becomes the only choice. SB – I haven’t fullow followed the refactor, but could you give a high-level summary of the refactor? So I understand what to test out. BP – Basically, “everything has changed”. So that’s many of the function names, many aspects of dataset navigation were made more intuitive. But the concepts are the same. SB – So, for example, metadata was getting put into a big json blob for whole datasets. Is that getting more broken up? Also some pain points in downloading large datasets. Happy to test out how those are changed in the refactor. BP – Absolutely. The db is now no longer dependent on large json blobs. It’s now broken out into tables, and structured like a much more traditional relational database. As you’re interacting with a dataset in qcportal, interacts with DB more often but in smaller pieces; no passing around fat JSON blobs you add an entry, sends only that; add a spec only sends that spec if you submit calcu DD – So calling `compute` now no longer requires shunting the entire dataset back and forth between client and server. Now it all happens on the server. SB – So if we could have access to a test server with some of the datasets (especially the chunkier ones that have been hard to pull down) perhaps a partial migration? ESP set Industry benchmark set Sage optimization set Dataset with >100k conformers Then we could try submitting some fat sets BP – The migration script is really slow, we could make a few new datasets and that might be better. SB – Either migrating or making a new dataset is OK by me. I just want to be able to test large access operations. BP – can also invalidate records, cancel records; so control surface for failures or bad calculations is richer SB – know there have been chats about internalizing error cycling; is that included? BP – I do have the place in the code that would allow for automating restarts of failed calculations also implemented compute history, where we hold on to previous errors necessary for restart logic that needs to see failure counts, types of failures TG – being able to query what kind of tasks feature which compute tags? BP – Not yet, that will come when we implement more permissions. Eg, we don’t want someone from outside to be able to assign their dataset the `openff` tag and take over your compute TG – For me, I’m interested in separating out resource requirements based on tag, and implementing some sort of logic for managers to know which tag they should be on. BP – tasks are now pretty hidden; everything goes through records restarts are done on records, cancels are done on records if you pull down a record you can inspect the task TG – And the tasks still have a baseresult/baserecord ID? BP – Yes, it’s baserecord now. DD – WRT “result” vs “record” and other name changes, BP and I discussed which terms we use currently, and how to standardize them in a meaningful way. BP – ... No more “collections”, only “datasets” now. All have “specifications” “entries”, and “records”. DD – Entries are rows, specifications are columns, elements of the table are records. And this is now true for all dataset types. SB – A use case that we might like is the ability to “cherry pick” the records in one or more datasets into a new dataset, to use it as a record for the data that was actually used. So we could, at the end, take a bunch of records from different datasets and have a final “here’s what we actually used for this process” TG – I kinda do this already, formal support would be great. associate hessians to molecules directly; iterating through molecules may touch many datasets; curious if there’s a better mechanism that can be supported server-side BP – DD – BP, we could look into the type of backend that would support this user story. SB and TG, could you provide more details on specifically what you’d like here? SB – I’ll open a user story on QCF repo in the next week or two DD will poke SB if the user story isn’t written up at the next call
Infrastructure needs/advances	DD – New optimization schema with scf properties evaluated at optimized geometry one extra step of doing single point at the final geometry, acceptable cost, any thoughts? SB – This is something I’ve been interested in doing before. Eg, with my recent dataset where I’m computing wavefuncitons, I only want wavefunctions and wiberg bond orders only for the last step. So I’d like to be able to do something like “optimize then do a special step at the end” BP – This seems like more of a QCEngine thing. DD – I could take this on, basically having QCEngine performing an optimization, then dropping the information from the intermediate steps. Would this require any change in QCF? BP – That’s pretty much how the protocols work now. The OptimizationResult pydantic model takes info from the whole trajectory, then knows whether it should keep the info from all the steps, or just the final step/just a few steps. Could …, or add a mode where it only gets stored at the last step, or could have QCF/QCEng know at each step could be dropped. DD – Are wavefunctions made anyway at each step by psi4? BP – Yes DD – Are there other things that should be kept? SB – wiberrg bond orders, lowdin indices, mbiis charges TG – There are some datasets where mbiis charges fail, and I have to tiptoe around successes there BP – I did lots of expensive operations in grad school, so I feel this DD + SB – Would be good to have native support for “use a cheap method for an optimization, then once that’s converged, use an expensive method” BP – That seems like it would fit most neatly in QCEngine. So like a process where it does an expensive single-point at the final step. IT would fit in QCEngine better than QCFractal. DD – I think we have enough information to proceed with a PR to QCEngine. I’ll be the primary driver but will loop in SB and JH for feedback. Importantly, I’ll look into whether it can be handled entirely in QCEng, or if it will need special support in QCF. SB – Let me know if I can help out. This would be a huge win for bespokefit. And if it can fit into existing data models that would be great. New default policy for optimization trajectory storage? DD – How often do we really use intermediate geometries/properties in an optimization trajectory? Or should we switch over our default to just storing the first and the last? TG – THuefner and someone from Chodera lab use intermediate geometries/properties from optimization trajectories SB – Yes, JC and Yuanqing, and some folks from the Cole lab experimented with using intermediate geometries. JH – Yeah, plenty of experimentation going on there TG – I found that the last ~50% of optimization trajectories aren’t very informative. DD – Maybe we should change the default in QCSubmit from “full” to “first and last”, and let people specify “full” if they really want the whole trajectory. TG – Personal opinion is not to call it “default”, since the spec is used BOTH during submission, but also during retrieval. And this would break the meaning of “default” in retrieval operations. DD – I’ll need to check on where in the process this “default” term is used to determine whether TG’s issue would affect existing calls. TG – It wouldn’t be a huge deal but it’d be good to be deliberate DD – I’ll make a PR on QCSubmit to propose this change and we can continue the discussion there. (Can also handle other suggested changes there) New QCEngine release with QCEngine#351 DD – I’ll push for a release with this change, will coordinate with LBurns in case there are other things that need to get in. SB – That’s great. If the release is out by the end of the week then we can incorporate it into bespokefit, if not, it’s not a huge deal. SB – H5py bug on QCFractal. JW – I’ll try to take this on but no promises about how far I can progress it, since it’s in someone else’s feedstock. SB -- There’s a workaround where we can pin h5py, but I’d rather have it become an optional dep. BP – Upcoming version doesn’t have a hard dep on hdf5.
Throughput status
New submissions	Pubchem set2 submitted
User questions
Science support needs

Action items

@David Dotson will poke @Simon Boothroyd on user story for ability to group records on refactored QCFractal in a useful way beyond just datasets

@David Dotson will make a PR against QCEngine that aims to create support for a final single point calculation following an optimization, ideally with its own program,method,basis,keywords.

@David Dotson will make a PR on QCSbumit to propose switch to saving only first and last molecules for an optimization trajectory by default, requiring opt-in for keeping all trajectory molecules; needs some obvious indicator for downstream users that trajectories feature only first and last molecules

@David Dotson will review QCEngine#351; push for release by EOW

Meetings

2022-03-08 QC meeting notes

Participants

Goals

Discussion topics

Action items

Decisions

Related content