2024-06-11 BW/LW Check-in

Participants

  • @Brent Westbrook (Unlicensed)

  • @Lily Wang

Goals

  •  

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

General updates and discussion on projects

 

  • Torsion splitting in 2.2

    • Still where it was before OMSF workshop

    • Torsion shapes next steps?

      • Comparing functional form to QM data

    • (still here, waiting on below)

  • Fragment dataset curation code

    • Trying to get all torsion parameters (~65 total) covered

    • Found couple issues with database – fragments didn’t match any parameters. Went back to matching against full molecules

    • Have a set of SMILES and torsions in SMILES that match ~half parameters not covered (see below point)

    • Current stage: For parameters that didn’t match any molecules in ChEMBL, will make those by hand and possibly expand with fragments in database (~28)

      • Anticipate ~day

      • LW: if this takes more than 20-30 min, might not have any good molecules and possibly skip

      • BW: have gone through this already and trimmed out a few parameters that I couldn’t match

    • LW: are you currently matching one molecule per torsion or multiple?

      • BW: currently 1, except in a few cases where the molecules it was matching looked quite different

    • BW: also checked molecules for charging issues, and running into omega conf generation issues (e.g. [S3+])

    • ChEMBL database is ~12GB (includes table of just fragments, and table of molecules. Fragments comprise most of the space)

      • Molecules get inchi keys computed

    • How long does it take to query for parameters?

      • 1-1.5 hours

      • ~2 million molecules or so

    • eMolecules has offered two datasets:

      • 50 M “eco-lite” set

      • full 3.9 trillion set

    • BW: the database has a number of fields, e.g. elements to make querying faster. It also handles duplicates, so if there’s lots, we can save space – but 3.9 T is a big jump

    • BW: took 24+ hours to fragment + store ChEMBL itself.

      • LW: does this include throwing out invalid molecules e.g. with radicals?

      • BW: no, staying in RDKit for this, so everything accepted

    • LW: planning to ask advisory board for ideas on expanding the benchmark dataset

    • BW: would it be worth pulling this out into a package for others to use? Sent Lexie some stuff but maybe we should pull it out

      • #ff-fitting?

      •  

  • LW – let’s plan projects and effort allocation over the next year

  • Overall budget: 40 weeks

  • 20% infrastructure: 8 weeks

  • 80% science: 32 weeks

** infrastructure
probably best equipped to work on

 

 

 

Action items

Decisions