2020-02-25 QM dataset selection

Date

25 Feb 2020

Participants

Goals

Determine new procedure for selecting QM datasets for fitting (potentially for May meeting release, if ready in time)
Divide up work to accomplish that

Notes:

Coverage of current QM dataset (Su)

127 of 163 torsions currently covered; that’s the largest issue of current sets
Bonds & angles pretty good; main gaps for *-I bonds
Looking NCI250K dataset; filtered to remove molecules with more than 20 heavy atoms; that leaves 135k SMILES.
So check that to see what it covers and check the coverage of other valance parameters from that data.

DLM: Other available datasets – note “discrepancies” set of molecules with geometries which disagree across FFs. It is overenriched in certain chemistries though so you wouldn’t want to use it as broadly representing all molecules.

DLM: Do we have to worry about diversity? If you had the choice of many molecules which would give you good coverage you would prefer the ones which give you diversity.

DLM: Other sets to look at: Bayer patent collection (all of molecules they have patented). Su: She’s checked them and tried to generate a torsion dataset, but it doesn’t cover that much. Covers less than coverage set.

Agreed big picture goals for dataset selection:

Achieving any coverage of missing torsions
Adequate coverage of all torsions — at least five molecules per torsion
Chemical diversity in coverage of parameters: The set of molecules using a given parameter is chemically diverse in the other portions of the molecule.

DLM: We should go to any means necessary to achieve 1-2 even if it means drawing molecules/coming up with molecules manually.

Bayer set might be useful for generating diversity of molecules we already cover.

DLM: Possible two pass algorithm, one pass going for diversity and the other for coverage (order unclear).
DLM: Even bonded parameters, we should be improving coverage (eg bonded parameters). We should be bringing the minimum counts higher.

DLM went over work he’d done for benchmarking in https://github.com/openforcefield/release-1-benchmarking/blob/master/QM_molecule_selection/divide_sets.ipynb

Rare parameters — occur in less than N molecules in training. In our first training there were 64 rare parameters with N=3.

Should we do reverse of that — generate diverse clusters, then pick training set molecules from all of the clusters?

Unclear what that would mean for coverage.

Lee-Ping: Possible way to deal with coverage: Start with very big list of molecules. Build a list of molecules which use each parameter (being careful that for torsions they are exocyclic torsions) then you do clustering within that list. Chemical similarity clustering for all moelcules which use each parameter, then pick diverse molecules which use that parameter (e.g. five most diverse).

We discussed taking the Bayer and NCI250K sets plus “all” of data we currently have, then pick from there using protocol Lee-Ping just described.

Lee-Ping: Do we apply procedure to each dataset separately, or to the aggregate dataset? If do it for the aggregate dataset may bias towards the larger datasets, which may be bad since each dataset has its own internal similarities, etc. We want to make sure we get good coverage from various datasets, so let’s do it separately for each dataset. For those datasets which don’t cover parameters, we’ll just skip those parameters.

Conclusion: Apply to each dataset separately. An addtional plus of this approach is that it provides beginning of algorithm for what we do with new datasets which come in.

Exception: “heterocyclic aromatic rings of the future”. Let’s just save that for the future.

Do we patch up current coverage gaps, or begin “independently”? One concern may be computational cost. Avoid submitting something 1000x bigger than Roche set for example. 100x bigger probably upper limit, and good to keep it lower than that — same order of magnitude as Roche set.

End conclusion was to attempt to do it independently rather than fixing coverage gaps.

Lee-Ping: We need to build a master spreadsheet of our molecule sets somewhere (on Confluence!). Then once we have it we should be able to apply this dataset generation procedure to any set on that list. (Need to avoid duplication across sets and across parameters though.)

Optimized geometries and vibrational frequencies — same molecules. Covers bonds and angles.

Can we complete this procedure and create master list of molecule sets and also set up and run calculations by early May meeting (with time to fit forcefields)? Would want data in good shape 2-4 weeks before meeting itself; will be in good spot if we’re just waiting for QM calculations to finish at that point. Jessica and Hyesu to come up with more details of plan and timeline and report back with what they think they can manage so we’ll know if it’s going to be ready in time.

Plan:

Enumerate datasets we’re going to potentially draw from

Implement selection procedure to be used for each dataset which would:

Identify molecules using each parameter
For the set of molecules using some parameter, cluster by chemical similarity and then pick five molecules from the five most diverse clusters (making sure they don’t duplicate molecules we already picked for some other parameter) and then generate requisite QC inputs

We would do that procedure separately for each dataset (again watching for duplicates)

That would be our new fitting data

DLM: We likely also ought to run all of the Bayer set as benchmark data (lower priority). (Whatever of it we don’t use for fitting.)

Division of labor:

Who writes which code: Jessica clustering; Su parameter usage.

Who pulls together list of molecules: Jessica and Su, enlisting David as needed

Su & Jessica talk to Jeff about architecture

Su checking about how to bypass semiempirical calculation while enumerating protonation states and tautomers.

There’s a slow step; generating JSON file from SDF files. Fragmenter is being slow doing that. Instead, read SDF file directly and label parameters in it. Still, might need to enumerate protonation state and tautomers. Check with Chaya if needed. Avoid semiempirical calculation, but enumerate protonation states and tautomers

A lot of the dataset selection code can be done already, e.g. get prototype working on Roche set.

Jessica Maat (Deactivated) to take DLM’s Jupyter notebook and produce prototype example which takes a specific parameter, looks at where it occurs, then clusters as described above.

DLM to create overall Confluence “area” for this and put notes there and then Su and Jessica can create sub-pages as needed.

Talk with Jeff: Class structure? script? Where does it live?

Action items

Hyesu Jang Create Confluence page listing all available datasets (with Jessica Maat (Deactivated) enlisting David Mobley as needed)
Jessica Maat (Deactivated) develop prototype notebook which takes a FF and a set of molecules and a target parameter (ID) and picks the five most diverse molecules using that parameter. Should also take an optional argument which is a list of molecules to exclude (so that molecules which have already been used in other sets can be skipped)
Jessica Maat (Deactivated) and Hyesu Jang reach out to Jeffrey Wagner to discuss architecture of tools to be constructed, plan for sustainability and for where they should live. [Scheduled this meeting for Wednesday March 4 10 am -JM]
Hyesu Jang to determine how to enumerate protonation states and tautomers without doing semiempirical calculations (to speed set prep) talking to Chaya Stern (Deactivated) if needed, or if it can’t be done via that route, getting back to David Mobley for help with ideas
Jessica Maat (Deactivated) and Hyesu Jang to come up with their goal timeline

Decisions

Decided to make systematic approach for selecting molecules for QM data generation & fitting given a target dataset; this will be applied dataset-by-dataset to select new molecules for use in fitting
Will attempt to select/redesign a new QM dataset for fitting rather than simply extending our prior QM dataset
Decided on tentative algorithm for molecule selection approach