Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

Participants

Goals

  • User questions/issues, new submissions

    • Hanging INCOMPLETEs on OpenFF Sandbox CHO PhAlkEthOH v1.0 OptimizationDataset

    • Large datasets choke REST API: https://github.com/openforcefield/qca-dataset-submission/pull/208#issuecomment-867293853

    • Torsion drive of propane hangs with Snowflake

    • “Experimental” / “study” submissions (Trevor, Simon)

    • Update to a dataset -

  • Science support needs

  • Infrastructure needs / advances

Discussion topics

Item

Notes

Hanging INCOMPLETEs on OpenFF Sandbox CHO PhAlkEthOH v1.0 OptimizationDataset

  • BP – Could you send some IDs from the failures in this set?

  • DD –

    [OptimizationRecord(id='32693660', status='INCOMPLETE'),
     OptimizationRecord(id='32693661', status='INCOMPLETE'),
     OptimizationRecord(id='32693662', status='INCOMPLETE'),
     OptimizationRecord(id='32693878', status='INCOMPLETE'),
     OptimizationRecord(id='32693879', status='INCOMPLETE'),
     OptimizationRecord(id='32693880', status='INCOMPLETE'),
     OptimizationRecord(id='32693891', status='INCOMPLETE'),
     OptimizationRecord(id='32693892', status='INCOMPLETE'),
     OptimizationRecord(id='32693896', status='INCOMPLETE'),
     OptimizationRecord(id='32693897', status='INCOMPLETE'),
     OptimizationRecord(id='32693898', status='INCOMPLETE'),
     OptimizationRecord(id='32693906', status='INCOMPLETE'),
     OptimizationRecord(id='32694805', status='INCOMPLETE'),
     OptimizationRecord(id='32694806', status='INCOMPLETE'),
     OptimizationRecord(id='32694807', status='INCOMPLETE'),
     OptimizationRecord(id='32694932', status='INCOMPLETE'),
     OptimizationRecord(id='32694933', status='INCOMPLETE'),
     OptimizationRecord(id='32694934', status='INCOMPLETE'),
     OptimizationRecord(id='32701335', status='INCOMPLETE'),
     OptimizationRecord(id='32701336', status='INCOMPLETE'),
     OptimizationRecord(id='32703639', status='INCOMPLETE'),
     OptimizationRecord(id='32703640', status='INCOMPLETE'),
     OptimizationRecord(id='32703641', status='INCOMPLETE')]
  • BP – These look complete from my end. The root cause could be a known bug that I thought we fixed. These were created/submitted AFTER we fixed the bug, though.

    • (General) – The status here is showing incomplete, but these have final molecules associated with them.

      • BP – The three I’ve grabbed here are all from the same manager. They’re in the task queue as “running”.

    • DD – Could we manually flip these to complete for now?

    • DD + BP will continue discussion offline

Large datasets choke REST API:


https://github.com/openforcefield/qca-dataset-submission/pull/208

  • DD – We recently tried to do a large expansion of a dataset. When we tried to add the MM compute spec, the metadata was too large for the upload.

  • BP – You’ll want to be able to have an endpoint that looks like /collections/#/entries. I may be able to increase that limit without increasing the manager upload size limit (which IS necessary). The current upload limit is 100MB for this, and 500MB for manager uploads. So I’ll bump the 100MB limit to 250 and we can see if that gets fixed.

Torsiondrive of propane hangs with Snowflake

  • SB – Torsiondrive of propane just hangs, no real way to influence it forward. Ethane and butane work fine.

    • BP: if you can drop it into slack, can try to reproduce and introspect

  • DD – I’m hoping that we can let you run this through geopt in the future. Should I prioritize that?

    • SB – I haven’t had too much trouble using QCEngine directly, so this isn’t urgent. Would it make sense to make a torsiondrive procedure in qcengine instead of putting it at the geopt layer?

    • BP – I’ve thought of this before – It wouldn’t be too much of a problem in QCEngine, but it could get really complex if we try to mirror that in QCFractal

    • DD – I’d thought about this too – It would simplify things by putting more of the process parallelism under one roof. So the thought would be that this could be a qcengine compute_procedure, but this looked a bit complex to me.

    • SB – I could see how QCEngine may not want to have too much nested under it. But generally having a lighter-weight endpoint for running torsiondrives, either in geopt or qcengine, would be a big help.

    • BP – Agree. I’ve had trouble answering how to just run a single torsiondrive to new users. With respect to implementing this, I don’t think it’d be too hard to just call .procedure and have the torsiondrive down there.

    • SB – Would it be helpful for me to open a PR to QCEngine for this functionality?

      • BP – Yes, I’d love to take a look at either an issue or PR on this.

    • DD + BP – This would be a compute.py → compute_procedure(<input_data with some input specific to torsiondrive>, ”geometric”) call

      • BP – For more details on what input_data should look like in this case, see QCElemental#264

    • JW: Can see why we chose to build things out in openff-gopt, if some decisions within QCEngine may take longer to resolve around how input structures should be specified

    • BP: #264 is something I’d like to resolve in the medium term, it is relevant to QCFractal in particular

    • SB: we’ve hit those same object issues of settings upon settings applied, keeping it from becoming a mess

“Experimental” / “study” submissions (Trevor, Simon)

  • TG: there’s a dataset that is used in the refit; there are issues about how we name things and the intention in the naming

    • trying to make distinction between “experiment” and a “study” dataset

      • experiment is permissive, can do whatever you want or need to do; exploratory

      • study is intended for consumption by others, needs to conform to certain expectations of consumers

  • SB – Issue with naming came from my uncertainty in trying to follow this. It seems like this almost needs two version numbers, but I do appreciate that you’re driving standards at all, and I appreciate how complex this is.

    • TG: changing the grid is subtle, makes it difficult to compare across specs

  • SB: question for David and Ben: are there constraints on length and characters for spec names?

    • BP: I don’t believe so; all just gets dumped into a JSON blob in the DB

    • SB: should we generally do spec names as method/basis-<other-keyword-choices>?

    • TG – Combinatorically, “what are all the settings that could make a compute spec unique?” is a really hard question. I think the root issue is that dataset names and spec names aren’t validated. But I think we’d benefit from defining a convention within OFF

    • SB – I think capturing method and basis are probably a good starting point. Then things like grid settings and PCM can also be included.

    • TG – Agree, it does get complicated. Though this is a good starting point

    • SB – We may be able to standardize on some grid naming scheme like what Gaussian uses (and I think psi4 also copies some of it).

  • SB – Also interested to know about status of standards V3

    • DD – We’ve adopted it, but I’ve been a bottleneck on implementing support in our automation for it. One of our needs is that our system for dataset support+naming has support in QCFractal. I’m hoping to do a retrospective on how our automated submission has been going over the first year of its existence, and include support for standards V3 in the refactor that that kicks off.

    • SB – I had been copying+pasting from previous submissions whenever I make a new one. So having a clean template for submissions would help me avoid making copy+paste errors

    • TG – We currently have an index of dataset names. Do we want to have a similar index for compute specs?

      • SB – That’s a great idea.

  • DD: one switch we can use for experimental datasets is setting their visibility to False on submission; would make it unlikely for consumers to stumble upon and start using seriously

  • JW: my understanding is that we can always create a new collection based on something we found worked well, then point to that in a publication?

    • TG: would call that a “study” at that point

  • TG – If we have an existing collection and plug in additional molecules and a different spec, will the old molecules be recomputed with the new spec?

    • BP – Yes.

    • (something complicated that made both TG and BP concerned)

    • (General) – Oh, we should never reuse the name of an existing specification or dataset to mean something different.

  • PB – DD, I’ve tagged you on a dataset validation error. Could you take a look?

    • DD – Yes

Submission execution status

  • DD – We’d noticed that some of SB’s datasets (aniline 2D and additional QM specs) weren’t making progress. This was because they were all high priority, but the industry benchmark set was also high priority and was submitted first, so it was taking all available compute.

Science support needs

Infrastructure needs / advances

Action items

  • David Dotson will follow-up with Ben on INCOMPLETE Optimizations that appear finished, but have tasks in RUNNING state
  • Ben Pritchard will increase the limit for submissions from 100MB to 250MB
  • David Dotson will prioritize openff-gopt optimization and torsiondrive executors, given Simon Boothroyd's needs
  • David Dotson will review Pavan Behara 's latest submission #215, assist with validation issues
  • David Dotson will create a tablular index on qca-dataset-submission for known specs and the names we would like to use for them consistently
  •  

Decisions

  • No labels