2024-12-17 QCA dataset submission meeting

Participants

  • @Lily Wang

  • @Jennifer A Clark

  • Alexandra McIsaac

  • Chris Ryan

  • @Brent Westbrook (Unlicensed)

  • @Jeffrey Wagner

  • @Matt Thompson

Discussion topics

Item

Notes

Item

Notes

Achira dataset proposals

  • LM – Current proposal is to expand parts of the SPICE2 dataset, and we don’t have compute/infra up yet. Wondered if we could collaborate and use QCSubmit/NRP. First idea is to do an opt dataset with a very limited number of steps on some or all of SPICE2. Would love to collaborate as much as you’re able to, datasets would be open source and hosted on QCA.

    • LW – Some technical Qs

    • SPICE1 hit some storage space limits on QCArchive - For this dataset, would it be all of SPICE2?

      • LM – Currently unsure, wanted to generate more data close to thermally accessible structures, but haven’t decided yet what fraction of structures we’d want to start with.

      • CR – Basically same understanding here.

      • LW – This would be helpful, would bring back to lead team

    • LW – Re: your own runners - are these for certain coming/is there a rollout date planned?

      • CR – We’re working on some things on our end to get compute set up.

    • LW – For dataset management, would the goal be to use Q-D-S, or just directly submit to MolSSI QCA, etc?

      • LM – Open to a few options.

      • CR – Can easily be flexible here.

  • LW – We’re generally open to work with you, but filling in more details would be great. Unfortunately we’re shutting down next week so wouldn’t be able to get things going until Jan if we don’t submit this week. What’s your level of urgency?

    • LM – I think we’d like to get things going before the break, but will get back to you. We’re also off the next two weeks.

    • CR – We’re meeting with folks to talk about compute resources later this week, unsure about exactly how urgent/what target dates are.

  • JW – BW do you recall

    • BW – Under 300 Da is easy for us to run. 300-600 gets complicated. But probably if the whole dataset is under 400Da it’s easy for us at our default level of theory.

    • LW: SPICE2 is ~2m conformations

    • JW: could probably to between 10-100k opts over the holidays

    • JW: may have to sort out storage directly with molssi for entire SPICE2 dataset

    • BW (in chat): For 300-600, 4 CPUs/32 GB RAM has worked well without really any intervention. 4/20 GB for <300 Da

    • LM – Sounds good, we’ll get more info and get back to you

  •  

MolSSI info

  • JW: BP can’t (currently) directly measure how much storage space a single record takes up. He can only compare the disk usage before and after a dataset is added and divide by the number of entries.

  • JCl: could people host data anywhere else? Do we have a sense of hosting costs and alternatives?

  • JW: you can self-host a QCArchive. Not sure I have a good sense of costs. We essentially have free storage with molssi and free compute with NRP, while we pay $15/mth to GitHub for the repo.

  •  

Update dataset tracking

https://github.com/orgs/openforcefield/projects/2/views/1

  • Update on in-flight sets by their compute owners

  •  

Follow-up on PR

  •  

Update on clean force field releases

  • JC – Summary of what was done, PR is mergeable now -

    • using BW’s qcaide

 

 

Action items

Decisions

Â