2021-07-02 QCA Submission meeting notes

Participants

Goals

User questions/issues, new submissions
- Hanging INCOMPLETEs on OpenFF Sandbox CHO PhAlkEthOH v1.0 OptimizationDataset
- Large datasets choke REST API: https://github.com/openforcefield/qca-dataset-submission/pull/208#issuecomment-867293853
- Torsion drive of propane hangs with Snowflake
- “Experimental” / “study” submissions (Trevor, Simon)
- Update to a dataset -
Science support needs
Infrastructure needs / advances

Discussion topics

Item	Notes
Hanging INCOMPLETEs on `OpenFF Sandbox CHO PhAlkEthOH v1.0` OptimizationDataset	BP – Could you send some IDs from the failures in this set? DD – [OptimizationRecord(id='32693660', status='INCOMPLETE'), OptimizationRecord(id='32693661', status='INCOMPLETE'), OptimizationRecord(id='32693662', status='INCOMPLETE'), OptimizationRecord(id='32693878', status='INCOMPLETE'), OptimizationRecord(id='32693879', status='INCOMPLETE'), OptimizationRecord(id='32693880', status='INCOMPLETE'), OptimizationRecord(id='32693891', status='INCOMPLETE'), OptimizationRecord(id='32693892', status='INCOMPLETE'), OptimizationRecord(id='32693896', status='INCOMPLETE'), OptimizationRecord(id='32693897', status='INCOMPLETE'), OptimizationRecord(id='32693898', status='INCOMPLETE'), OptimizationRecord(id='32693906', status='INCOMPLETE'), OptimizationRecord(id='32694805', status='INCOMPLETE'), OptimizationRecord(id='32694806', status='INCOMPLETE'), OptimizationRecord(id='32694807', status='INCOMPLETE'), OptimizationRecord(id='32694932', status='INCOMPLETE'), OptimizationRecord(id='32694933', status='INCOMPLETE'), OptimizationRecord(id='32694934', status='INCOMPLETE'), OptimizationRecord(id='32701335', status='INCOMPLETE'), OptimizationRecord(id='32701336', status='INCOMPLETE'), OptimizationRecord(id='32703639', status='INCOMPLETE'), OptimizationRecord(id='32703640', status='INCOMPLETE'), OptimizationRecord(id='32703641', status='INCOMPLETE')] BP – These look complete from my end. The root cause could be a known bug that I thought we fixed. These were created/submitted AFTER we fixed the bug, though. (General) – The status here is showing incomplete, but these have final molecules associated with them. BP – The three I’ve grabbed here are all from the same manager. They’re in the task queue as “running”. DD – Could we manually flip these to complete for now? DD + BP will continue discussion offline
Large datasets choke REST API:	https://github.com/openforcefield/qca-dataset-submission/pull/208 DD – We recently tried to do a large expansion of a dataset. When we tried to add the MM compute spec, the metadata was too large for the upload. BP – You’ll want to be able to have an endpoint that looks like `/collections/#/entries`. I may be able to increase that limit without increasing the manager upload size limit (which IS necessary). The current upload limit is 100MB for this, and 500MB for manager uploads. So I’ll bump the 100MB limit to 250 and we can see if that gets fixed.
Torsiondrive of propane hangs with Snowflake	SB – Torsiondrive of propane just hangs, no real way to influence it forward. Ethane and butane work fine. BP: if you can drop it into slack, can try to reproduce and introspect DD – I’m hoping that we can let you run this through geopt in the future. Should I prioritize that? SB – I haven’t had too much trouble using QCEngine directly, so this isn’t urgent. Would it make sense to make a torsiondrive procedure in qcengine instead of putting it at the geopt layer? BP – I’ve thought of this before – It wouldn’t be too much of a problem in QCEngine, but it could get really complex if we try to mirror that in QCFractal DD – I’d thought about this too – It would simplify things by putting more of the process parallelism under one roof. So the thought would be that this could be a qcengine `compute_procedure`, but this looked a bit complex to me. SB – I could see how QCEngine may not want to have too much nested under it. But generally having a lighter-weight endpoint for running torsiondrives, either in geopt or qcengine, would be a big help. BP – Agree. I’ve had trouble answering how to just run a single torsiondrive to new users. With respect to implementing this, I don’t think it’d be too hard to just call `.procedure` and have the torsiondrive down there. SB – Would it be helpful for me to open a PR to QCEngine for this functionality? BP – Yes, I’d love to take a look at either an issue or PR on this. DD + BP – This would be a `compute.py → compute_procedure(<input_data with some input specific to torsiondrive>, ”geometric”)` call BP – For more details on what input_data should look like in this case, see QCElemental#264 JW: Can see why we chose to build things out in `openff-gopt`, if some decisions within QCEngine may take longer to resolve around how input structures should be specified BP: #264 is something I’d like to resolve in the medium term, it is relevant to QCFractal in particular SB: we’ve hit those same object issues of settings upon settings applied, keeping it from becoming a mess
“Experimental” / “study” submissions (Trevor, Simon)	TG: there’s a dataset that is used in the refit; there are issues about how we name things and the intention in the naming trying to make distinction between “experiment” and a “study” dataset experiment is permissive, can do whatever you want or need to do; exploratory study is intended for consumption by others, needs to conform to certain expectations of consumers SB – Issue with naming came from my uncertainty in trying to follow this. It seems like this almost needs two version numbers, but I do appreciate that you’re driving standards at all, and I appreciate how complex this is. TG: changing the grid is subtle, makes it difficult to compare across specs SB: question for David and Ben: are there constraints on length and characters for spec names? BP: I don’t believe so; all just gets dumped into a JSON blob in the DB SB: should we generally do spec names as `method/basis-<other-keyword-choices>`? TG – Combinatorically, “what are all the settings that could make a compute spec unique?” is a really hard question. I think the root issue is that dataset names and spec names aren’t validated. But I think we’d benefit from defining a convention within OFF SB – I think capturing method and basis are probably a good starting point. Then things like grid settings and PCM can also be included. TG – Agree, it does get complicated. Though this is a good starting point SB – We may be able to standardize on some grid naming scheme like what Gaussian uses (and I think psi4 also copies some of it). SB – Also interested to know about status of standards V3 DD – We’ve adopted it, but I’ve been a bottleneck on implementing support in our automation for it. One of our needs is that our system for dataset support+naming has support in QCFractal. I’m hoping to do a retrospective on how our automated submission has been going over the first year of its existence, and include support for standards V3 in the refactor that that kicks off. SB – I had been copying+pasting from previous submissions whenever I make a new one. So having a clean template for submissions would help me avoid making copy+paste errors TG – We currently have an index of dataset names. Do we want to have a similar index for compute specs? SB – That’s a great idea. DD: one switch we can use for experimental datasets is setting their visibility to `False` on submission; would make it unlikely for consumers to stumble upon and start using seriously JW: my understanding is that we can always create a new collection based on something we found worked well, then point to that in a publication? TG: would call that a “study” at that point TG – If we have an existing collection and plug in additional molecules and a different spec, will the old molecules be recomputed with the new spec? BP – Yes. (something complicated that made both TG and BP concerned) (General) – Oh, we should never reuse the name of an existing specification or dataset to mean something different.
	PB – DD, I’ve tagged you on a dataset validation error. Could you take a look? DD – Yes
Submission execution status	DD – We’d noticed that some of SB’s datasets (aniline 2D and additional QM specs) weren’t making progress. This was because they were all high priority, but the industry benchmark set was also high priority and was submitted first, so it was taking all available compute.
Science support needs
Infrastructure needs / advances

Action items

David Dotson will follow-up with Ben on INCOMPLETE Optimizations that appear finished, but have tasks in RUNNING state
Ben Pritchard will increase the limit for submissions from 100MB to 250MB
David Dotson will prioritize openff-gopt optimization and torsiondrive executors, given Simon Boothroyd's needs
David Dotson will review Pavan Behara 's latest submission #215, assist with validation issues
David Dotson will create a tablular index on qca-dataset-submission for known specs and the names we would like to use for them consistently

Participants

Goals

Discussion topics

Action items

Decisions

0 Comments