2024-12-04 Mitchell/Wagner Check-in meeting notes

Participants

Discussion topics

Item	Notes
Proteinbenchmark
General updates	JW MT is working with LW on getting evaluator running smoothly on Kubernetes, and I’ve told him to reach out to you if he has questions. JM Offline next week Thinking about the experience of discovering OpenFF - I’d like to prominently display a list/table of what we can simulate. Ex direclty say “if you have an organic small molecule, we can simulate it”, and have that link to the datasets and benchmarks and stuff. Possibly have a piece metadata associated with each FF that describes what it’s recommended for, etc. Ex conjugated systems JW – Good idea, not sure if it’s top priority compared to everything else but I’ll add to team backlog. JM – Even communicating that we have a product would be an improvement.
Loading census	JM – Tested first 1,000 and MDAnalysis can handle about 25% of them. Almost all errors were too-many-bonds (Hs and Cs). Others were toolkit choking on radicals. Might be that it’s a challenging test set (rough initial geometries from heavy atom replacement and hydrogen addition). Given that 75% had error-raising problems, I wonder how much of the 25% we can trust. So I’m updating my loader to have the reference (protonation states, etc) to compare against. JW – Hm, so even if we get rid of 90% of the loading errors by minimizing/doing process optimizations, we’ll still have 7.5% errors, which is way too high. Do you know what % the toolkit could load (ie, how many are totally vanilla proteins)? JM – Haven’t checked yet. (General) – The MDA loader isn’t a clear winner here. JW – I need to figure out: How much of the census we want to continue doing/what to report to boards How to structure/prioritize the josh-loader Start with project page Behavior specification/benchmarks/tests/performance expectations Will this offset workshops before annual meeting? Is that worth it? JM – Target API is: class Topology: def from_pdb( file: PathLike \| TextIO, use_canonical_names: bool = False, unique_molecules: list[Molecule] = [], residue_database: Mapping[ str, list[ResidueDefinition] ] = CCD_RESIDUE_DEFINITION_CACHE, ) -> Topology: class ResidueDefinition: def from_smiles( resname: str, mapped_smiles: str, atom_names: Mapping[int, str], ) -> ResidueDefinition def from_molecule() def from_capped_molecule() # For AAs, nucleotides, etc JW – I’ll need to think a little bit about the residuedefinition creation pathway - Not sure what I’d expect users to have for their NCAA. Test set design The PDBFixer set is good Be sure to include strained conformations Include pdbs from as many different software packages as possible (Amber, GROMACS, pymol, OpenMM, …) Capped and uncapped (both neutral and charged) AAs Behavior/specification Support residue/atom name “synonyms”? Eg loading AMBER atom/residue names? Pre-populated residue library (to prevent airgapped computers from having connectivity issues?) Or would this get too big (hundreds of MB)? JM – It’s very likely to be too large, though we could ship a common subset. JW – Could print an error message in this case. mmCIF and PDBx All edge cases: Missing atoms compared to all residue definitions for resname If they’re leaving atoms If it’s a linking residue template If the residues on both sides on the bond are missing their linking leaving atoms and have matching linking type (`_chem_comp.type` matches) Succeed by linking residues without changing formal charges … NOT … Error If it’s NOT a linking residue template Error If they’re NOT leaving atoms Error Extra atoms compared to all residue definitions Error No match for residue name NO CONECT records Default error Nice to have but not initial goal: user opt-in guess If there are CONECT records If there’s an isomorphic (ignoring BO) molecule graph in `unique_molecules` Succeed (Nice to have, otherwise can make all this a separate API point) If additional substructures can assign bond orders and formal charges to ALL atoms and bonds that weren’t assigned by resname If ALL atoms+bonds that overlap with previous assignment by resname get identical info Succeed If ANY atoms+bonds disagree or are left unassigned Error (or maybe in the future fall back to the guesser) Else Default error Nice to have but not initial goal: user opt-in (better) guess Stereo is unspecified in residue template ALWAYS ignore template stereo and assign from 3d (RDKit can do this) Residue template has aromatic bond(s) Error (components.cif doesn’t appear to actually use this) Residue name and atom name matches template, but formal charge, element, or CONECT record does not match Error as informatively as possible Multiple residue definitions match a residue (like, two identical ALAs are defined) JM – Would prefer to take first one (this is quick because it stops the search once any match is found) JW – Maybe raise an error or ensure that no two identical graphs can be defined for a resname? … We’ll pick back up here
Trello	https://trello.com/b/dzvFZnv4/infrastructure

Participants

Discussion topics

Action items

Decisions