General updates and discussion on projects | | Torsion splitting in 2.2 Still where it was before OMSF workshop Torsion shapes next steps? (still here, waiting on below)
Fragment dataset curation code Trying to get all torsion parameters (~65 total) covered Found couple issues with database – fragments didn’t match any parameters. Went back to matching against full molecules Have a set of SMILES and torsions in SMILES that match ~half parameters not covered (see below point) Current stage: For parameters that didn’t match any molecules in ChEMBL, will make those by hand and possibly expand with fragments in database (~28) Anticipate ~day LW: if this takes more than 20-30 min, might not have any good molecules and possibly skip BW: have gone through this already and trimmed out a few parameters that I couldn’t match
LW: are you currently matching one molecule per torsion or multiple? BW: also checked molecules for charging issues, and running into omega conf generation issues (e.g. [S3+]) ChEMBL database is ~12GB (includes table of just fragments, and table of molecules. Fragments comprise most of the space) How long does it take to query for parameters? eMolecules has offered two datasets: 50 M “eco-lite” set full 3.9 trillion set
BW: the database has a number of fields, e.g. elements to make querying faster. It also handles duplicates, so if there’s lots, we can save space – but 3.9 T is a big jump BW: took 24+ hours to fragment + store ChEMBL itself. LW: does this include throwing out invalid molecules e.g. with radicals? BW: no, staying in RDKit for this, so everything accepted
LW: planning to ask advisory board for ideas on expanding the benchmark dataset BW: would it be worth pulling this out into a package for others to use? Sent Lexie some stuff but maybe we should pull it out
LW – let’s plan projects and effort allocation over the next year Overall budget: 40 weeks 20% infrastructure: 8 weeks 80% science: 32 weeks
** infrastructure probably best equipped to work on yammbs qcsubmit possibly some RDKit stuff in the toolkit I've also read a lot of ForceBalance code but not implemented any features torsion multiplicity project PDB fraction I've been interested in trying to quantify which of our parameters are good, which I think relates pretty closely to this besmarts FB replacement + alternative functional forms dataset curation (~7 weeks) standardized benchmarking like I mentioned in the meeting the other day, I'm really interested in a setup where we basically push a button and it tells us if a new force field is good or not LW: science or technical side? e.g. automation vs developing new benchmarks? BW: more technical – e.g. passing a force field through “CI” and getting back results LW: meaning on the roadmap -- YAMMBS developing new benchmarks, e.g. dimer benchmarking, Chapin’s NMR benchmarks, condensed phase properties, solvation free energies BW: interested in both science and technical Science side: 3 weeks
Infrastructure: 8 weeks Torsion multiplicity: 6 weeks PDB fraction: 8 weeks besmarts: 4 weeks smee/alternative functional forms: 4 weeks dataset curation: 7 weeks standardized benchmarking: 3 weeks Plans for next week?
|