2024-10-22 QCA dataset submission meeting

Participants

@Alexandra McIsaac
@Lily Wang
@Brent Westbrook
@Jeffrey Wagner

Discussion topics

Item	Notes

Item	Notes
Compute management	JW – I think I’ve left responsibilities vague/undefined while going on/offline a lot the past 2 weeks. I’d be happy with a range of outcomes, including me managing everything or lily managing everything. Status quo seems to be me running old workers and lily handling pyddx, which I’m also fine with, though this may be wasteful of two people’s attention. LM – Was running my own workers since JW was offline. LW – I think it could make sense for dataset owners to be ultimately responsible for the computation of their dataset, either by managing their own workers or communicating with central person. JW – This sounds like a good approach. For future datasets, I’ll expect submitter/owner to contact me to manage workers, otherwise will assume they’ll run their own. JW will take over Lipid MAPS worker management and continue running workers for the phosphate torsiondrive dataset
What happened to `qc_record.extras`?	JW/BW – https://github.com/openforcefield/openff-bespokefit/issues/369 BW: Seems like “extras” check was added to handle local data, and doesn’t contain important stuff for QCA datasets. Should probably leave it so that it can handle local data, but not worried about the fact that so many “extras” are missing. Weird thing is that QCA used to set “extras” to an empty dict, but now returns None JW: Bespoke fit mostly uses local data, so definitely don’t want to get rid of this BW: Don’t want to get rid of extras check, but should put it behind a try/except or some other workaround in case it doesn’t exist Severity: General bespokefit usage - Unsure General QCArchive/QCSubmit usage - Might encounter this data if loading any data from QCArchive (maybe data created before/after a certain time point) Main-line force field fitting - BW – Note that the line that looked at `extras['id']` instead of the `id` attribute directly was just a hacky fix for local data access. It’s NOT scary that `extras` is empty/None for QC datasets, since it was always unpopulated. LM – I found that some of my datasets had things in the `extras` field, but it was stuff like compute tags. I only encountered this in new datasets like torsion multiplicity (TorsionDrive, failed), but not sulfur dataset (Optimization, succeeded) JW – So, All old datasets are “good” with current bespokefit logic New Opt datasets are “good” with current bespokefit logic New TD datasets are “bad” with current bespokefit logic (?) LW – Did QCArchive just update server version? BW – I think so. Until recently I was getting server/client version mismatches with the latest client, but now I don’t get that with the latest client, and I queried the server yesterday and saw it was running 0.56 LM: But my dataset, which I downloaded yesterday, had empty dictionaries for most entries, but `None` for the torsion multiplicity datasets BW: And this same dataset used to work JW – Ok, so it seems like this is something that changed on QCA (but I will confirm next week with BP). Our remedy is to code around it, and BW is already working on a PR for this. (General) – Agree Implications for broader FF fitting? LM – I think there are implications - In order to do FF fitting, we need to change bespokefit. This will affect our pipeline. JW – Any chance this gives a silent numerical change instead of error? BW + LM – No, this is a loud error, so the issue is a complete crash and not irreproducible/numerically changed results. LW – Could you make a new release of bespokefit once this is patched? JW – Yes, absolutely. JW – Is this blocking FF fitting work? BW + LM – Can just proceed with modified local code for now, but not best practice. JW – I’ll review PR ASAP and will make release soon after.
Discussion of OpenFF QCA dataset standards	LW: This is a public facing document, so we should decide what we do want to follow, and remove anything we don’t want to follow. Luckily, we have been following one part, which is accurately marking our datasets as not following the guidelines JW: This basically got dropped after Simon left, I have no particular attachment to these standards but we should have some standards Specific discussion points LW: Not sure what it means by meaning of molecule names? We don’t name the molecules JW: We used to have [SMILES]-[conf number] for the benchmark project LW: Is that a default for QCSubmit? JW: Changelog doesn’t really make sense, we don’t change our datasets JW: Blacklist doesn’t make sense to me, need more info. If there were problems with our dataset, we’d submit a new one with a new version LM: Maybe bad QCA IDs, to filter out after the fact? LW: Maybe, that would have to be added after the fact Let’s ask trevor Dataset status: LW/JW: These are a pain and no one does it, let’s get rid of it LW: We could use the same tags as the project board JW: Could be a way to automate it with the project board, but I don’t know it, the updates to projects are confusing Will continue discussion offline and ping trevor

Meetings

2024-10-22 QCA dataset submission meeting

Participants

Discussion topics

Action items

Decisions