Date

04 Mar 2020

Participants

Time

Item

Presenter

Notes

Dataset filtering

Maat and Jang need to make code to make training data for the next fitting generation
HJ – Will be based on taking a large dataset of compounds, clustering, and picking the molecule with a lot of coverage
JM – Open to a lot of different options for how to do this
JH – Filtering could become a component in the submission workflow
JM – This dataset would be used for the Sage fit
JW – Are we approaching a limit on how much data we can feed into ForceBalance?
- JW + HJ – Let’s assume no upper limit for now
HJ – We will use Roche set, coverage set, new set

What will dataset look like?

We have 200 torsion terms in our FF, so we’d want 5 scans for each torsion, so 1000 torsiondrives

QCArchive submission

Timeline

HJ – Running optimization takes ~1 day, sometimes need to run 5 times for data trouble.

JW – Assuming 2 days per torsion, time 1000 torsions.

Jeffrey Wagner will tell John Chodera NOT to submit 50k dataset, or to submit at LOW PRIORITY. We will need bandwidth for this submission
Jessica Maat (Deactivated)Hyesu Jang will coordinate to work together on this in the coming weeks. Target date for submission is March 20th
Joshua Horton will make a checklist for pre-submission (bond orders requested, CMILES attached)

Ensure all submissions have cmiles, most important are mapped hydrogen smiles
Ensure the WBO is requested for all submissions, this should be included in the scf properties list using the flag wiberg_lowdin_indices
If any calculations are to be redone from another collection re-use the old input (coordinates, atom ordering etc) used as this will avoid running the calculation again and will just create new references in the database to the old results and should help keep the cost of the calculations down.