2024-12-04 Mitchell/Wagner Check-in meeting notes

Participants

  • @Josh Mitchell

  • @Jeffrey Wagner

Discussion topics

Item

Notes

Item

Notes

Proteinbenchmark

 

General updates

  • JW

    • MT is working with LW on getting evaluator running smoothly on Kubernetes, and I’ve told him to reach out to you if he has questions.

  • JM

    • Offline next week

    • Thinking about the experience of discovering OpenFF - I’d like to prominently display a list/table of what we can simulate. Ex direclty say “if you have an organic small molecule, we can simulate it”, and have that link to the datasets and benchmarks and stuff. Possibly have a piece metadata associated with each FF that describes what it’s recommended for, etc. Ex conjugated systems

      • JW – Good idea, not sure if it’s top priority compared to everything else but I’ll add to team backlog.

      • JM – Even communicating that we have a product would be an improvement.

    •  

Loading census

  • JM – Tested first 1,000 and MDAnalysis can handle about 25% of them. Almost all errors were too-many-bonds (Hs and Cs). Others were toolkit choking on radicals. Might be that it’s a challenging test set (rough initial geometries from heavy atom replacement and hydrogen addition). Given that 75% had error-raising problems, I wonder how much of the 25% we can trust. So I’m updating my loader to have the reference (protonation states, etc) to compare against.

  • JW – Hm, so even if we get rid of 90% of the loading errors by minimizing/doing process optimizations, we’ll still have 7.5% errors, which is way too high. Do you know what % the toolkit could load (ie, how many are totally vanilla proteins)?

    • JM – Haven’t checked yet.

  • (General) – The MDA loader isn’t a clear winner here.

  • JW – I need to figure out:

    • How much of the census we want to continue doing/what to report to boards

    • How to structure/prioritize the josh-loader

      • Start with project page

      • Behavior specification/benchmarks/tests/performance expectations

      • Will this offset workshops before annual meeting? Is that worth it?

    • JM – Target API is:

      class Topology: def from_pdb( file: PathLike | TextIO, use_canonical_names: bool = False, unique_molecules: list[Molecule] = [], residue_database: Mapping[ str, list[ResidueDefinition] ] = CCD_RESIDUE_DEFINITION_CACHE, ) -> Topology: class ResidueDefinition: def from_smiles( resname: str, mapped_smiles: str, atom_names: Mapping[int, str], ) -> ResidueDefinition def from_molecule() def from_capped_molecule() # For AAs, nucleotides, etc
    • JW – I’ll need to think a little bit about the residuedefinition creation pathway - Not sure what I’d expect users to have for their NCAA.

    • Test set design

      • The PDBFixer set is good

      • Be sure to include strained conformations

      • Include pdbs from as many different software packages as possible (Amber, GROMACS, pymol, OpenMM, …)

      • Capped and uncapped (both neutral and charged) AAs

    • Behavior/specification

      • Support residue/atom name “synonyms”? Eg loading AMBER atom/residue names?

      • Pre-populated residue library (to prevent airgapped computers from having connectivity issues?) Or would this get too big (hundreds of MB)?

        • JM – It’s very likely to be too large, though we could ship a common subset.

        • JW – Could print an error message in this case.

      • mmCIF and PDBx

  • All edge cases:

    • Missing atoms compared to all residue definitions for resname

      • If they’re leaving atoms

        • If it’s a linking residue template

          • If the residues on both sides on the bond are missing their linking leaving atoms and have matching linking type (_chem_comp.type matches)

            • Succeed by linking residues without changing formal charges

          • … NOT …

            • Error

        • If it’s NOT a linking residue template

          • Error

      • If they’re NOT leaving atoms

        • Error

    • Extra atoms compared to all residue definitions

      • Error

    • No match for residue name

      • NO CONECT records

        • Default error

        • Nice to have but not initial goal: user opt-in guess

      • If there are CONECT records

        • If there’s an isomorphic (ignoring BO) molecule graph in unique_molecules

          • Succeed

        • (Nice to have, otherwise can make all this a separate API point) If additional substructures can assign bond orders and formal charges to ALL atoms and bonds that weren’t assigned by resname

          • If ALL atoms+bonds that overlap with previous assignment by resname get identical info

            • Succeed

          • If ANY atoms+bonds disagree or are left unassigned

            • Error (or maybe in the future fall back to the guesser)

        • Else

          • Default error

          • Nice to have but not initial goal: user opt-in (better) guess

    • Stereo is unspecified in residue template

      • ALWAYS ignore template stereo and assign from 3d (RDKit can do this)

    • Residue template has aromatic bond(s)

      • Error (components.cif doesn’t appear to actually use this)

    • Residue name and atom name matches template, but formal charge, element, or CONECT record does not match

      • Error as informatively as possible

    • Multiple residue definitions match a residue (like, two identical ALAs are defined)

      • JM – Would prefer to take first one (this is quick because it stops the search once any match is found)

      • JW – Maybe raise an error or ensure that no two identical graphs can be defined for a resname?

      • We’ll pick back up here

Trello

https://trello.com/b/dzvFZnv4/infrastructure

Action items

Decisions