2020-02-25 Chemical Perception meeting notes
Date
Feb 25, 2020
Participants
@David Mobley
@Lee-Ping Wang
@Hyesu Jang
@Jessica Maat (Deactivated)
Goals
Determine new procedure for selecting QM datasets for fitting (potentially for May meeting release, if ready in time)
Divide up work to accomplish that
Notes:
Coverage of current QM dataset (Su)
127 of 163 torsions currently covered; thatโs the largest issue of current sets
Bonds & angles pretty good; main gaps for *-I bonds
Looking NCI250K dataset; filtered to remove molecules with more than 20 heavy atoms; that leaves 135k SMILES.
So check that to see what it covers and check the coverage of other valance parameters from that data.
DLM: Other available datasets โ note โdiscrepanciesโ set of molecules with geometries which disagree across FFs. It is overenriched in certain chemistries though so you wouldnโt want to use it as broadly representing all molecules.
DLM: Do we have to worry about diversity? If you had the choice of many molecules which would give you good coverage you would prefer the ones which give you diversity.
DLM: Other sets to look at: Bayer patent collection (all of molecules they have patented). Su: Sheโs checked them and tried to generate a torsion dataset, but it doesnโt cover that much.ย Covers less than coverage set.
Agreed big picture goals for dataset selection:
Achieving any coverage of missing torsions
Adequate coverage of all torsions โ at least five molecules per torsion
Chemical diversity in coverage of parameters: The set of molecules using a given parameter is chemically diverse in the other portions of the molecule.
DLM: We should go to any means necessary to achieve 1-2 even if it means drawing molecules/coming up with molecules manually.
Bayer set might be useful for generating diversity of molecules we already cover.
DLM: Possible two pass algorithm, one pass going for diversity and the other for coverage (order unclear).
DLM: Even bonded parameters, we should be improving coverage (eg bonded parameters). We should be bringing the minimum counts higher.
DLM went over work heโd done for benchmarking in release-1-benchmarking/QM_molecule_selection/divide_sets.ipynb at master ยท openforcefield/release-1-benchmarking
Rare parameters โ occur in less than N molecules in training. In our first training there were 64 rare parameters with N=3.
Should we do reverse of that โ generate diverse clusters, then pick training set molecules from all of the clusters?ย
Unclear what that would mean for coverage.ย
Lee-Ping: Possible way to deal with coverage: Start with very big list of molecules. Build a list of molecules which use each parameter (being careful that for torsions they are exocyclic torsions) then you do clustering within that list.ย Chemical similarity clustering for all molecules which use each parameter, then pick diverse molecules which use that parameter (e.g. five most diverse).ย
We discussed taking the Bayer and NCI250K sets plus โallโ of data we currently have, then pick from there using protocol Lee-Ping just described.ย
Lee-Ping: Do we apply procedure to each dataset separately, or to the aggregate dataset? If do it for the aggregate dataset may bias towards the larger datasets, which may be bad since each dataset has its own internal similarities, etc. We want to make sure we get good coverage from various datasets, so letโs do it separately for each dataset.ย For those datasets which donโt cover parameters, weโll just skip those parameters.
Conclusion: Apply to each dataset separately. An addtional plus of this approach is that it provides beginning of algorithm for what we do with new datasets which come in.
Exception: โheterocyclic aromatic rings of the futureโ. Letโs just save that for the future.ย
Do we patch up current coverage gaps, or begin โindependentlyโ? One concern may be computational cost. Avoid submitting something 1000x bigger than Roche set for example. 100x bigger probably upper limit, and good to keep it lower than that โ same order of magnitude as Roche set.
End conclusion was to attempt to do it independently rather than fixing coverage gaps.
Lee-Ping: We need to build a master spreadsheet of our molecule sets somewhere (on Confluence!).ย Then once we have it we should be able to apply this dataset generation procedure to any set on that list.ย (Need to avoid duplication across sets and across parameters though.)
Optimized geometries and vibrational frequencies โ same molecules. Covers bonds and angles.ย
Can we complete this procedure and create master list of molecule sets and also set up and run calculations by early May meeting (with time to fit forcefields)? Would want dataย in good shape 2-4 weeks before meeting itself; will be in good spot if weโre just waiting for QM calculations to finish at that point. Jessica and Hyesu to come up with more details of plan and timeline and report back with what they think they can manage so weโll know if itโs going to be ready in time.
Plan:
Enumerate datasets weโre going to potentially draw from
Implement selection procedure to be used for each dataset which would:
Identify molecules using each parameter
For the set of molecules using some parameter, cluster by chemical similarity and then pick five molecules from the five most diverse clusters (making sure they donโt duplicate molecules we already picked for some other parameter) and then generate requisite QC inputs
We would do that procedure separately for each dataset (again watching for duplicates)
That would be our new fitting data
DLM: We likely also ought to run all of the Bayer set as benchmark data (lower priority). (Whatever of it we donโt use for fitting.)
Division of labor:
Who writes which code: Jessica clustering; Su parameter usage.
Who pulls together list of molecules: Jessica and Su, enlisting David as needed
Su & Jessica talk to Jeff about architecture
Su checking about how to bypass semiempirical calculation while enumerating protonation states and tautomers.
Thereโs a slow step; generating JSON file from SDF files. Fragmenter is being slow doing that. Instead, read SDF file directly and label parameters in it. Still, might need to enumerate protonation state and tautomers. Check with Chaya if needed.ย Avoid semiempirical calculation, but enumerate protonation states and tautomers
A lot of the dataset selection code can be done already, e.g. get prototype working on Roche set.
@Jessica Maat (Deactivated) to take DLMโs Jupyter notebook and produce prototype example which takes a specific parameter, looks at where it occurs, then clusters as described above.ย
DLM to create overall Confluence โareaโ for this and put notes there and then Su and Jessica can create sub-pages as needed.
Talk with Jeff: Class structure? script? Where does it live?ย
Action items
states
module in fragmenter
generates reasonable protonation / tautomer states. It uses quacpac and does not need AM1 calculations so is fast.https://github.com/openforcefield/fragmenter/blob/master/fragmenter/states.py โ
Decisions
- Decided to make systematic approach for selecting molecules for QM data generation & fitting given a target dataset; this will be applied dataset-by-dataset to select new molecules for use in fitting
- Will attempt to select/redesign a new QM dataset for fitting rather than simply extending our prior QM dataset
- Decided on tentative algorithm for molecule selection approach