2020-08-12 All-hands meeting notes

Date

Aug 12, 2020

Participants

  • @Karmen Condic-Jurkic , @Trevor Gokey , @David Dotson , @Jeffrey Wagner , @John Chodera , @Simon Boothroyd , @Jessica Maat (Deactivated) , @David Hahn , @Michael Shirts

Meeting recording

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

Dataset structuring

@Jeffrey Wagner

  • JW – As an “open science, open data, open source” projects, we need to improve our dataset handling – now is a good time to work on it, after spending some time on working on infrastructure and good force fields

    • We should aim at consistency and simplicity, without making a fat document about standardsA

    • Many people working on data have come to multiple good solutions – I want to arrive to a solution on how to do it and an operational model so that it’s actually done

    • Goal – a brief checklist to follow when making datasets

    • Slides

    • QC molecules to 3D graph molecules and energies has some complexities – two paths to pull down graph molecules from QCSchema

    • We have multiple types of datasets – SMILES, unlabeled 3D graph molecules (SDF), QCSchema molecules, labeled 3D graphs

    • The first step to convert a concept of molecules to 2D graph is still a bit rough

    • Hyesu keeps a provenance of fitting datasets submitted to QCArchive (graph molecules + ID)

    • What is a dataset? – the use of having a brief description included in the file – “this dataset contains X conformers of Y unique molecules”

  • JDC – Jeff, does this approach call for API and having / creating objects that can be read according to certain specifications?

  • JW – Tricky point where we need to be tomorrow and where we are today. We can’t put out a dataset that only we can read, that wouldn’t be very open.

  • JDC – Many cheminformatics folks have already solve that problem and have their own data storage object, OE has done it, but they don’t expect to become a standard (there are no incentives).

  • SB – I wanted to make a similar point, once you have your own API point, whatever that might be, you can start writing scripts around it and you can always go back and change it, if you know what input is expected. This doesn’t have to be wrapped in Python, it can be a simple CLI.

  • JDC – It’s not a perfect solution, but it prevents us from shoehorning everything in this terrible SDF forever.

  • JW – We could switch to OFFmols by Sage release, but probably not sooner. We need to check how this will fit in our development plans.

  • JDC – We need to decide what this data needs to do before we need to do anything. Having a huge tarball full of offmol files might not be useful. What would be useful to other people.

  • JW – Trying to understand a scope of this… – for example, pulling molecule data from QCArchive with hessians…

  • JDC – Do we create a copy of QCSchema or create a lossy copy? Those are the questions we want to think about when do conversions.

  • JW – We recently realised that metadata attached to QCArchive/QCSchema is a string, and anything else brings some additional complications.

  • SB – What information is missing in QCA? That’s a living database and it’s constantly migrated, and embedding some extra information there would be the best way to go, with the additional information.

  • JW – We lack bond orders and

  • SB – Potentially host a QCA mirror and have a more detailed metadata schema.

  • TG – QCA has so much data, it’s about how you want to organize it.

  • SB – That might be the best way forward, and build RESTful API etc.

  • DD – QCA has a QCElemental model – would it make sense to push QCA to support 2D graph model?

  • JW – I talked to Levi about it and he’s happy for us to own this part, he’d help with it. There’s a lot of complexity in 2D and 3D graphs, and validating those CMILES can be a challenge.

  • JDC – We’ll have to spend some time verifying those cheminformatics tools. We can verify our own datasets, but it’s harder to validate external datasets.

  • JW – The validator validates, we’re not turning it off for our own datasets.

  • JW – A good goal eventually to have would be connecting graphs to CMILES in some way (aromaticity interpretation can be tricky, for example)

    • Maybe we should use a refactored operation of QCMolecule

  • TG – Since we’re always converting QCMolecule to OFFMolecule, we need to think what we want to keep, I see this conversion path as a potential problem

  • JW – keeping a connection table is probably good

  • TG – You can submit the same geometry with different connectivity to QCA and QCA deduplicates it because it considers it the same molecule.

  • JW – We can discuss this more later.

  • JW – In the process of making datasets, I started capturing data being used for a particular task, just in case things change in the meantime and we can’t exactly recreate these datasets. I then derive other information from it, like CMILES, etc.

  • JW – Operational issues:

    • Unstructured explorations

    • Custom analysis code isn’t well versioned, and well-versioned code can’t be customized – inside OpenFF issues, not so much external facing one

    • JDC – Anything that makes it into the main branch of FFs needs to have bolts tightened in terms of provenance / data productions (being on a conservative side).

    • JW – Our operational model incentivizes that we put our complexities somewhere (for example, openforcefield-forcebalance) and then forget about it.

    • DD – There is not an “end-all” solutions, I think we’re already doing the right thing – doing what we can and when we can. You did a good job outlining a big picture here which helps us to decide what needs additional hammering. I think you’re already doing a good job here, just to be reassuring.

    • JW – That’s how QCSubmit was made, taking a whole range of manual submission and figuring out what we might want in a regular submission and automate it.

    • JW – We might have to come up with a set of rules of interacting with data, a checklist or having a data czar.

FE calculations success

@John Chodera

  • ML/MM preprint comes with great results for Tyk2 with OpenFF 1.0.0

  • Additional set of results came in yesterday for Jnk1 from JACS benchmark set confirms this finding – very impressive!

     

  • Maybe also saving the world with covid-moonshot!

Show & tell (internal comms)

@Karmen Condic-Jurkic

  • All-hands meetings with a “show & tell” part – quick informal updates from folks what are they up to – have volunteers and a designated presenting team (by topic?) – thoughts?

  • JW – Our meeting notes are remarkably good – would anyone object making those public?

  • MS – I think 99% is ok to go public, but we need to be careful not to include some sensitive data.

  • KCJ – internal vs external facing level

  • JW – different levels of communication – “show&tell” with some flaws (fun informal) and then publicly share this information (serious formal)

  • Once a month all hands meeting

Website release

 

  • Scheduled for Aug 18

Misc

 

@Matt Thompson off the grid (literally) – all meetings cancelled until further notice – contact @Jeffrey Wagner if you need

Action items

@Karmen Condic-Jurkic will update the frequency of all-hands meeting to every 4 weeks.

Decisions