2021-05-13 API Refactor check-in Meeting notes

Date

13 May 2021

Participants

Goals

Advance in API refactor perceiving hierarchies and substructures matching.

Discussion topics

Item

Notes

Subgraph matching resources

SDF → perceive_residues

Essential bits covered in the amber_ff_porting scripts that we’ve made in previous meetings.
To apply substructure library, I suggest Molecule.chemical_environment_matches → This does SMARTS matching and should be pretty directly suitable to assigning atom metadata
- For SMARTS matching with substructures that may overlap, see LibraryCharges.create_force
Differentiating caps? CYX vs. CYS?
- In general, there will always be atoms that could be matched by multiple substructures. So we need to somehow order/prioritize the matches from least to most specific (like how we do in the force field).
  - Could do this manually
  - Could automate this
    - Alanine looks like a methyl cap
      - put methyl cap BEFORE alanine in substructure dictionary
    - CYX is a substructure of CYS
      - CYX should go BEFORE CYS
    - General rule could be “smaller substructure always comes before larger substructure”
      - Could sort by number of atoms → Let’s do this initially
      - Could determine ordering by doing a substructure search of every substructure in every other substructure – If substructure 1 is inside of substructure 2, then 1 must come before 2 in the dictionary
In the long run, it may be good to harvest these substructure from CCD, but it’s not clear how easy that will be.
- components.cif has alanine (comp_id=ALA), but it is capped with hydrogens – Is there some pattern we could use to find all the amino acids in their “main chain” form and extract their substructure dictionary from the CCD?
- Also, is there a general substructure dictionary format that’s standard in the field? If so, we should use that.

PDB → OFFMol w/ bond orders and charges

Existing Topology.from_openmm uses networkx to do complete structure matching – Can modify this to do SUBstructure matching
- Topology.from_openmm:
- Molecule.are_isomorphic:
Should we ever trust residue/atom names that are being read from the PDB file?
- JW – I think we shouldn’t. This would invite a lot of complexity and guessing around user intent.
- LW – could do what OpenMM does, which is raise an error like “I don’t know what this thing is but it almost looks like ARG”
  - JW – Should we match PDB substructures based on connectivity, or metadata/names?
    - When we load from PDB, two molecule representations will be made:
      - a placeholder OFFMol will be created, with one atom for each atom in the PDB. These OFFMol atoms will immediately gain all of the metadata from the PDB atoms (atom name, residue name, residue number)
      - A networkX graph of the entire PDB will be created, and matched to our substructure library to figure out the elements, bond orders, stereochemistry, and formal charges. The information from this matching will then go and modify the OFFMol from the last bullet point with the new info, and this networkX graph of the protein will be deleted.
    - We will try to match PDBs using their connectivity
- IP – Will users from other communities have PDB files that don’t have eg. residues? Like, do materials scientists have PDBs with monoatomic ions that where PDB data does fully encode their meaning?
  - In these cases, they’ll need to do TypedMolecule.from_file('xxx.pdb')
What if there are multiple molecules in the PDB file?
- Molecule.from_pdb_file(source) → List[Molecule]
  - If ANY molecules in the PDB fail to load, then nothing is returned
- If they’re both proteins:
  - Return a list of molecules
- If it’s a protein and a ligand:
  - This will raise an UnrecognizedSubstructureError by default. If the user really wants to be dangerous they could add the ligand to the substructure library, but we wouldn’t guarantee correctness.
What if the user WANTS to load a modified AA, and can tell us about it?
- Users could have their own libraries/extend our library
  - Big question here is which format they’d enter their new substructures in
  - IP will do another research cycle on reading substructures from CCD (basically “can we populate our canonical AA substructures from something in CCD?”)
In general, what should happen if a user runs Molecule.from_pdb_file and there’s an unrecognized substructure in the input?
- Options could include:
  - An error like OpenMM – At least saying “there’s an unrecognized substructure that is made of these atom indices”, and even better would be “there’s an unrecognized substructure, but it almost matches XYZ”
  - ~~A warning (and then it returns ???)~~
  - ~~Could just load a TypedMolecule~~
Can the substructure dictionary above help us go from PDB → SDF?
- In general, should our design only have one substructure database, or two?
- What if we had a protein with a mix of D and L amino acids? The cheminformatics substructure would know the difference, but the PDB substructures wouldn’t.
- How will we tell the difference between serine and cysteine? Only difference is O-->S
- How will we tell the difference between methionine and selenomethionine (S-->Se)?
- LW – could mandate that element is required.
  - LW – I tried loading a element-less PDB file using OpenMM and it correctly guessed the elements (probably using atom name)
  - (LW tried loading https://files.rcsb.org/download/6CPZ.pdb using the same method, to see whether it distinguishes between Se and S in the selenomethionine)
    - If we remove the element column, OpenMM can not tell the difference between sulfur and selenium
- So, we need the element column to be required.
- The best outcome will be reading PDB/PDBx/MMCIF using RDKit (and possibly also OpenEye)
  - For now let’s aim to load PDB using RDKit. The initial plan will be to not support MMCIF until it’s in RDKit, though if we have extra dev time, we can try making MDA/MDT a soft dependency and allow users to try loading MMCIF with those.
- The next best outcome would be to use OpenMM/MDAnalysis/MDTraj

Next steps

Beginning implementation
- Should start building on a branch of the openff toolkit (not a fork)
  - This can follow the “major feature branch” development pattern, where we target PRs into a branch instead of into master. This way, we can review the changes in small bits (and skip review when Jeff’s busy), and not have to review the final product in its entirety before merging to master.
  - Eg:
- Squash merge!! each PR into the feature branch. This is so that, if we decide to make major design changes midway through, but don’t want to throw everything away, we can cherry-pick entire PRs into wherever we start working next.

Action items

Iván Pulido Do another search on reading substructures from CCD.

2021-05-13 API Refactor check-in Meeting notes

Date

Participants

Goals

Discussion topics

Action items

Decisions