2024-11-05 QCA dataset submission meeting

Participants

  • @Alexandra McIsaac

  • @Lily Wang

  • @Jeffrey Wagner

  • @Brent Westbrook

Discussion topics

Item

Notes

Item

Notes

Compute management

  • Discuss splitting datasets by molecule size for NRP worker management purposes

    • When submitting datasets with large range in molecule size, would be helpful to submit as two separate datasets to help manage NRP deployments

    • Can’t request large memory NRP deployments if half the jobs are only using a small amount, NRP will get mad

    • So, could submit big datasets as multiple sets binned by size, and then once they’re all done submit them as one big dataset.

    • LM – Should make this a standard?

      • LW – Should write this down somewhere.

    • LM – Could bin at widths of 300Da - less than that ran on 8GB.

    • JW – I usually think in terms of heavy atoms - like everything under 15 HA could be “small”, then 15-25 could be “medium”, and then bins of 10 above that.

    • LW – I like the dalton idea

    • JW – Ok, that would work too - 20 carbons is 300ish daltons.

    • LM – I’ll add this to repo README

      • LW – Please so

Update dataset tracking

https://github.com/orgs/openforcefield/projects/2/views/1

  • Update on in-flight sets by their compute owners

    • BW – LipidMAPS

      • Running slowly, would help to split it up, difficult to find the right size for the workers due to range of molecular size

      • JW: Problem where errored molecules get added to the top of the stack to be re-submitted, but those errors were all OOM errors, so no small jobs could even start

      • LW – Could we chunk this out now?

      • JW: Seems to be working ok for now since these large jobs all got added to the front of the queue so the utilization is standardized now (maybe temporarily)

      • BW – I’m ok keeping this one the way it is. It has stratified already so we’re handling the biggest jobs now. But for future sets I’m a fan of splitting.

      • JW – Blocker now is some NRP-specific stuff, not having to do with dataset size binning

    • MLPepper

      • BW – We don’t have workers going on this yet. I’ve been talking with CAdams, some confusion about what changed in the environment when installing offtk that made the QC jobs work. CAdams said thatr qcfractalcompute and qcportal DOWNgraded (from 0.56 to 0.54.1) when he installed toolkit, and this made jobs work. I’ll bring this up in MolSSI meeting.

      • LW – Did other packages change (like numpy or pint?)

        • pint went from 0.23 (works) to 0.24 (broken)

        • numpy went from 1.26 to 2.1.3 (numpy 1 → 2 was a big technical change in some ways – can’t recall if we would expect it to affect defaults for e.g. allclose)

        • dftd3 1.1 → 1.2?

        •  

      • BW – I haven’t done a complete diff of the envs that CAdams posted, so could be something else.

      •  

  • Remove tracking label from datasets that are done being computed

    • Done

  • Add unmerged dataset PRs to backlog/queued for submission.

    • Done

  • JW will remove all dataset cards from the project board except for acting/backlogged datasets, and will remove archivedcomplete column

Discussion of OpenFF QCA dataset standards

  • PR with draft of new Standards:

  • Slack discussion/summary: https://openforcefieldgroup.slack.com/archives/CFA4NL63E/p1730445385693869?thread_ts=1729699202.251809&cid=CFA4NL63E

  • Finalize discussion of:

    • Filtering out/blacklisting problematic QCAIDs

      • LW (chat) – I think people do use our datasets without talking to us. Arguable if we want to encourage that but e.g. XFF comes to mind :-)

      • JW – I’m slightly in favor of multiple versions of datasets without problematic entries instead of adding a new type of thing to filter as a “blacklist”.

      • LM – I kinda agree, was thinking that it would help outsiders coming to old datasets with known problems to know what to leave out.

      • BW – I think it’ll be good to maintain a changelog and direct people to the most recent version

      • LW – All good idea. I’m not necessarily in favor of a blacklist, but rather communicating clearly that we don’t use some records and why. So many of these approaches would solve this problem. This isn’t just for external users, it’ll help US keep knowledge over the years.

      • LM – Should we ever delete problematic records from QCA?

      • JW – No, strongly against.

      •  

      •  

      •  

    • Release vs development datasets

      • LM – TG had suggested that we release fitting datasets for each FF (like, the final entries/records/mols that we ACTUALLY used, after filtering). My slight concern here is provenance - like once we use nagl, different mols will pass/fail the charge checks, and for example that we fixed the conformerrmsd filter so the behavior changed there. So maybe we could have “master” datasets that we maintain filtered versions of.

      • BW –

      • LW –

      • LM – Maybe I’m proposing maintianing TWO datasets for each release - One unfiltered, another filtered. Then maybe we periodically re-filter and make new releases of the latter as our infrastructure changes. Eg we’d have a sage 2.2 “unfiltered” dataset, and that would be like the sage 2.0 “filtered” dataset.

      • JW –

      • LM – So (and this is TG’s idea) - for each FF release, we release one nice clean dataset that doesn’t need filtering that has all the fitting dataset. But as we’re doing filtering for the future datasets

      • LW – Right. We should provide complete+filtered release datasets. So since our fitting datasets are comprised of like 10 smaller datasets, we’d combine them ahead of time into a mega-dataset, and

      • JW:

        • I think it’s ok if filtering gives different results on different days

      • LM: I think this is a discussion of convenience/usability more than reproducibility

      • LW – I don’t see much downside here, but also not much upside. Basically this avoids needing to re-fetch lots of little datasets, which I’m in weakly favor of.

      •  

    • Metadata

      • LM – TG was in favor of putting pretty much entire repo into QCA metadata. We didn’t see much downside to doing that - slightly annoying but not a big deal, especially if we just do it for release datasets.

      • LW – This could be automated - part of QCSubmit dataset factory. It’d be a bit of work but then it’d be done for good.

      • LM – Could open an issue to do this, and add it when possible. In the meantime, we can agree that we’ll add this to the next major release (sage 2.3)

      • JW – Weakly in favor since this keeps relevant metadata with the actual data in the event that we can’t use GitHub.

      •  

    • Organizing FF fitting repos on GH

      • JW - We kinda have this at https://openforcefield.org/force-fields/force-fields/

      • LM – TG points out that you need to know what the repo is called to find it. He didn’t know that there WAS a 2.2.0 fitting repo, or how to find the dataset for it.

      • LW (chat) – Maybe the answer is to link to the site page from the GitHub landing page?

      • LM – Good idea.

      • LW (chat) – I think we also link to all repos from the openff-forcefields releases too

      • (General) – Would be good to add link to sage-2.y.z repos and releases to somewhere central, like the table in openff-forcefields README.

      • LM – I’ll take care of this

      •  

      •  

 

 

Action items

Decisions