2024-06-11 BW/LW Check-in

Participants

@Brent Westbrook (Unlicensed)
@Lily Wang

Goals

Discussion topics

Item	Presenter	Notes

Item

Presenter

Notes

General updates and discussion on projects

Torsion splitting in 2.2
- Still where it was before OMSF workshop
- Torsion shapes next steps?
  - Comparing functional form to QM data
- (still here, waiting on below)
Fragment dataset curation code
- Trying to get all torsion parameters (~65 total) covered
- Found couple issues with database – fragments didn’t match any parameters. Went back to matching against full molecules
- Have a set of SMILES and torsions in SMILES that match ~half parameters not covered (see below point)
- Current stage: For parameters that didn’t match any molecules in ChEMBL, will make those by hand and possibly expand with fragments in database (~28)
  - Anticipate ~day
  - LW: if this takes more than 20-30 min, might not have any good molecules and possibly skip
  - BW: have gone through this already and trimmed out a few parameters that I couldn’t match
- LW: are you currently matching one molecule per torsion or multiple?
  - BW: currently 1, except in a few cases where the molecules it was matching looked quite different
- BW: also checked molecules for charging issues, and running into omega conf generation issues (e.g. [S3+])
- ChEMBL database is ~12GB (includes table of just fragments, and table of molecules. Fragments comprise most of the space)
  - Molecules get inchi keys computed
- How long does it take to query for parameters?
  - 1-1.5 hours
  - ~2 million molecules or so
- eMolecules has offered two datasets:
  - 50 M “eco-lite” set
  - full 3.9 trillion set
- BW: the database has a number of fields, e.g. elements to make querying faster. It also handles duplicates, so if there’s lots, we can save space – but 3.9 T is a big jump
- BW: took 24+ hours to fragment + store ChEMBL itself.
  - LW: does this include throwing out invalid molecules e.g. with radicals?
  - BW: no, staying in RDKit for this, so everything accepted
- LW: planning to ask advisory board for ideas on expanding the benchmark dataset
- BW: would it be worth pulling this out into a package for others to use? Sent Lexie some stuff but maybe we should pull it out
  - #ff-fitting?
LW – let’s plan projects and effort allocation over the next year
- Roadmap is https://docs.google.com/presentation/d/1CYhur-N-M31WKa41VyGZa-3vlV6sinWepod2V-tumNI/edit?usp=sharing
- Preferred projects:
Overall budget: 40 weeks
20% infrastructure: 8 weeks
80% science: 32 weeks

** infrastructure
probably best equipped to work on

yammbs
qcsubmit
possibly some RDKit stuff in the toolkit
I've also read a lot of ForceBalance code but not implemented any features
- LW: maybe focus FB effort on alternatives
torsion multiplicity project
- Estimated time:
  - Iteration: TD set, Opt set?, fit, and benchmark: ~3 weeks expected
  - 6 weeks
PDB fraction
I've been interested in trying to quantify which of our parameters are good,
which I think relates pretty closely to this
- Quantify how much of PDB can be covered with infrastructure
- CCD is better defined (wwPDB: Chemical Component Dictionary )
- 8 weeks
besmarts
- 4 weeks
FB replacement + alternative functional forms
- 4 weeks
dataset curation (~7 weeks)
- Develop better benchmarking dataset with wider chemical coverage
  - 3 weeks
- Work with interested parties to make sure their areas of chemistry are covered
  - 1 week
- Fragment database
  - 3 weeks
standardized benchmarking
like I mentioned in the meeting the other day, I'm really interested in a
setup where we basically push a button and it tells us if a new force field
is good or not
- LW: science or technical side? e.g. automation vs developing new benchmarks?
- BW: more technical – e.g. passing a force field through “CI” and getting back results
- LW: meaning on the roadmap --
  - YAMMBS
  - developing new benchmarks, e.g. dimer benchmarking, Chapin’s NMR benchmarks, condensed phase properties, solvation free energies
  - BW: interested in both science and technical
  - Science side: 3 weeks
Infrastructure: 8 weeks
Torsion multiplicity: 6 weeks
PDB fraction: 8 weeks
besmarts: 4 weeks
smee/alternative functional forms: 4 weeks
dataset curation: 7 weeks
standardized benchmarking: 3 weeks
Plans for next week?
- datasets until end of the week
- fititng + bencharking next week

Meetings

2024-06-11 BW/LW Check-in

Participants

Goals

Discussion topics

Action items

Decisions

Related content