Advance in API refactor perceiving hierarchies and substructures matching.
Discussion topics
Item
Notes
Subgraph matching resources
SDF → perceive_residues
Essential bits covered in the amber_ff_porting scripts that we’ve made in previous meetings.
To apply substructure library, I suggest Molecule.chemical_environment_matches → This does SMARTS matching and should be pretty directly suitable to assigning atom metadata
For SMARTS matching with substructures that may overlap, see LibraryCharges.create_force
Differentiating caps? CYX vs. CYS?
In general, there will always be atoms that could be matched by multiple substructures. So we need to somehow order/prioritize the matches from least to most specific (like how we do in the force field).
Could do this manually
Could automate this
Alanine looks like a methyl cap
put methyl cap BEFORE alanine in substructure dictionary
CYX is a substructure of CYS
CYX should go BEFORE CYS
General rule could be “smaller substructure always comes before larger substructure”
Could sort by number of atoms → Let’s do this initially
Could determine ordering by doing a substructure search of every substructure in every other substructure – If substructure 1 is inside of substructure 2, then 1 must come before 2 in the dictionary
In the long run, it may be good to harvest these substructure from CCD, but it’s not clear how easy that will be.
components.cif has alanine (comp_id=ALA), but it is capped with hydrogens – Is there some pattern we could use to find all the amino acids in their “main chain” form and extract their substructure dictionary from the CCD?
Also, is there a general substructure dictionary format that’s standard in the field? If so, we should use that.
PDB → OFFMol w/ bond orders and charges
Existing Topology.from_openmm uses networkx to do complete structure matching – Can modify this to do SUBstructure matching
Topology.from_openmm:
Molecule.are_isomorphic:
Should we ever trust residue/atom names that are being read from the PDB file?
JW – I think we shouldn’t. This would invite a lot of complexity and guessing around user intent.
LW – could do what OpenMM does, which is raise an error like “I don’t know what this thing is but it almost looks like ARG”
JW – Should we match PDB substructures based on connectivity, or metadata/names?
When we load from PDB, two molecule representations will be made:
a placeholder OFFMol will be created, with one atom for each atom in the PDB. These OFFMol atoms will immediately gain all of the metadata from the PDB atoms (atom name, residue name, residue number)
A networkX graph of the entire PDB will be created, and matched to our substructure library to figure out the elements, bond orders, stereochemistry, and formal charges. The information from this matching will then go and modify the OFFMol from the last bullet point with the new info, and this networkX graph of the protein will be deleted.
We will try to match PDBs using their connectivity
IP – Will users from other communities have PDB files that don’t have eg. residues? Like, do materials scientists have PDBs with monoatomic ions that where PDB data does fully encode their meaning?
In these cases, they’ll need to do TypedMolecule.from_file('xxx.pdb')
What if there are multiple molecules in the PDB file?
Molecule.from_pdb_file(source) → List[Molecule]
If ANY molecules in the PDB fail to load, then nothing is returned
If they’re both proteins:
Return a list of molecules
If it’s a protein and a ligand:
This will raise an UnrecognizedSubstructureError by default. If the user really wants to be dangerous they could add the ligand to the substructure library, but we wouldn’t guarantee correctness.
What if the user WANTS to load a modified AA, and can tell us about it?
Users could have their own libraries/extend our library
Big question here is which format they’d enter their new substructures in
IP will do another research cycle on reading substructures from CCD (basically “can we populate our canonical AA substructures from something in CCD?”)
In general, what should happen if a user runs Molecule.from_pdb_file and there’s an unrecognized substructure in the input?
Options could include:
An error like OpenMM – At least saying “there’s an unrecognized substructure that is made of these atom indices”, and even better would be “there’s an unrecognized substructure, but it almost matches XYZ”
A warning (and then it returns ???)
Could just load a TypedMolecule
Can the substructure dictionary above help us go from PDB → SDF?
In general, should our design only have one substructure database, or two?
What if we had a protein with a mix of D and L amino acids? The cheminformatics substructure would know the difference, but the PDB substructures wouldn’t.
How will we tell the difference between serine and cysteine? Only difference is O-->S
How will we tell the difference between methionine and selenomethionine (S-->Se)?
LW – could mandate that element is required.
LW – I tried loading a element-less PDB file using OpenMM and it correctly guessed the elements (probably using atom name)
If we remove the element column, OpenMM can not tell the difference between sulfur and selenium
So, we need the element column to be required.
The best outcome will be reading PDB/PDBx/MMCIF using RDKit (and possibly also OpenEye)
For now let’s aim to load PDB using RDKit. The initial plan will be to not support MMCIF until it’s in RDKit, though if we have extra dev time, we can try making MDA/MDT a soft dependency and allow users to try loading MMCIF with those.
The next best outcome would be to use OpenMM/MDAnalysis/MDTraj
Next steps
Beginning implementation
Should start building on a branch of the openff toolkit (not a fork)
This can follow the “major feature branch” development pattern, where we target PRs into a branch instead of into master. This way, we can review the changes in small bits (and skip review when Jeff’s busy), and not have to review the final product in its entirety before merging to master.
Eg:
Squash merge!! each PR into the feature branch. This is so that, if we decide to make major design changes midway through, but don’t want to throw everything away, we can cherry-pick entire PRs into wherever we start working next.
Action items
Iván Pulido Do another search on reading substructures from CCD.
Decisions
No labels
0 Comments
You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.
0 Comments