2024-10-22 QCA dataset submission meeting

Participants

  • @Alexandra McIsaac

  • @Lily Wang

  • @Brent Westbrook (Unlicensed)

  • @Jeffrey Wagner

Discussion topics

Item

Notes

Item

Notes

Compute management

  • JW – I think I’ve left responsibilities vague/undefined while going on/offline a lot the past 2 weeks. I’d be happy with a range of outcomes, including me managing everything or lily managing everything. Status quo seems to be me running old workers and lily handling pyddx, which I’m also fine with, though this may be wasteful of two people’s attention.

    • LM – Was running my own workers since JW was offline.

    • LW – I think it could make sense for dataset owners to be ultimately responsible for the computation of their dataset, either by managing their own workers or communicating with central person.

      • JW – This sounds like a good approach. For future datasets, I’ll expect submitter/owner to contact me to manage workers, otherwise will assume they’ll run their own.

    • JW will take over Lipid MAPS worker management and continue running workers for the phosphate torsiondrive dataset

    •  

What happened to qc_record.extras?

  • JW/BW – https://github.com/openforcefield/openff-bespokefit/issues/369

    • BW: Seems like “extras” check was added to handle local data, and doesn’t contain important stuff for QCA datasets. Should probably leave it so that it can handle local data, but not worried about the fact that so many “extras” are missing. Weird thing is that QCA used to set “extras” to an empty dict, but now returns None

    • JW: Bespoke fit mostly uses local data, so definitely don’t want to get rid of this

    • BW: Don’t want to get rid of extras check, but should put it behind a try/except or some other workaround in case it doesn’t exist

  • Severity:

    • General bespokefit usage - Unsure

    • General QCArchive/QCSubmit usage - Might encounter this data if loading any data from QCArchive (maybe data created before/after a certain time point)

    • Main-line force field fitting -

    • BW – Note that the line that looked at extras['id'] instead of the id attribute directly was just a hacky fix for local data access. It’s NOT scary that extras is empty/None for QC datasets, since it was always unpopulated.

    • LM – I found that some of my datasets had things in the extras field, but it was stuff like compute tags. I only encountered this in new datasets like torsion multiplicity (TorsionDrive, failed), but not sulfur dataset (Optimization, succeeded)

    • JW – So,

      • All old datasets are “good” with current bespokefit logic

      • New Opt datasets are “good” with current bespokefit logic

      • New TD datasets are “bad” with current bespokefit logic (?)

    • LW – Did QCArchive just update server version?

      • BW – I think so. Until recently I was getting server/client version mismatches with the latest client, but now I don’t get that with the latest client, and I queried the server yesterday and saw it was running 0.56

      • LM: But my dataset, which I downloaded yesterday, had empty dictionaries for most entries, but None for the torsion multiplicity datasets

      • BW: And this same dataset used to work

    • JW – Ok, so it seems like this is something that changed on QCA (but I will confirm next week with BP). Our remedy is to code around it, and BW is already working on a PR for this.

      • (General) – Agree

    • Implications for broader FF fitting?

      • LM – I think there are implications - In order to do FF fitting, we need to change bespokefit. This will affect our pipeline.

      • JW – Any chance this gives a silent numerical change instead of error?

      • BW + LM – No, this is a loud error, so the issue is a complete crash and not irreproducible/numerically changed results.

    • LW – Could you make a new release of bespokefit once this is patched?

      • JW – Yes, absolutely.

    • JW – Is this blocking FF fitting work?

      • BW + LM – Can just proceed with modified local code for now, but not best practice.

      • JW – I’ll review PR ASAP and will make release soon after.

    •  

Discussion of OpenFF QCA dataset standards

  • LW: This is a public facing document, so we should decide what we do want to follow, and remove anything we don’t want to follow. Luckily, we have been following one part, which is accurately marking our datasets as not following the guidelines

  • JW: This basically got dropped after Simon left, I have no particular attachment to these standards but we should have some standards

  • Specific discussion points

    • LW: Not sure what it means by meaning of molecule names? We don’t name the molecules

      • JW: We used to have [SMILES]-[conf number] for the benchmark project

      • LW: Is that a default for QCSubmit?

    • JW: Changelog doesn’t really make sense, we don’t change our datasets

    • JW: Blacklist doesn’t make sense to me, need more info. If there were problems with our dataset, we’d submit a new one with a new version

      • LM: Maybe bad QCA IDs, to filter out after the fact?

      • LW: Maybe, that would have to be added after the fact

      • Let’s ask trevor

    • Dataset status:

      • LW/JW: These are a pain and no one does it, let’s get rid of it

      • LW: We could use the same tags as the project board

      • JW: Could be a way to automate it with the project board, but I don’t know it, the updates to projects are confusing

    • Will continue discussion offline and ping trevor

 

 

Action items

Decisions