Subgraph matching resources | SDF → perceive_residues Essential bits covered in the amber_ff_porting scripts that we’ve made in previous meetings. To apply substructure library, I suggest Molecule.chemical_environment_matches → This does SMARTS matching and should be pretty directly suitable to assigning atom metadata Differentiating caps? CYX vs. CYS? In the long run, it may be good to harvest these substructure from CCD, but it’s not clear how easy that will be. components.cif has alanine (comp_id=ALA), but it is capped with hydrogens – Is there some pattern we could use to find all the amino acids in their “main chain” form and extract their substructure dictionary from the CCD?
Also, is there a general substructure dictionary format that’s standard in the field? If so, we should use that.
PDB → OFFMol w/ bond orders and charges Existing Topology.from_openmm uses networkx to do complete structure matching – Can modify this to do SUBstructure matching Topology.from_openmm:
Molecule.are_isomorphic:
Should we ever trust residue/atom names that are being read from the PDB file? JW – I think we shouldn’t. This would invite a lot of complexity and guessing around user intent. LW – could do what OpenMM does, which is raise an error like “I don’t know what this thing is but it almost looks like ARG” IP – Will users from other communities have PDB files that don’t have eg. residues? Like, do materials scientists have PDBs with monoatomic ions that where PDB data does fully encode their meaning?
What if there are multiple molecules in the PDB file? Molecule.from_pdb_file(source) → List[Molecule]
If they’re both proteins: If it’s a protein and a ligand:
What if the user WANTS to load a modified AA, and can tell us about it? In general, what should happen if a user runs Molecule.from_pdb_file and there’s an unrecognized substructure in the input? Options could include: An error like OpenMM – At least saying “there’s an unrecognized substructure that is made of these atom indices”, and even better would be “there’s an unrecognized substructure, but it almost matches XYZ” A warning (and then it returns ???) Could just load a TypedMolecule
Can the substructure dictionary above help us go from PDB → SDF? In general, should our design only have one substructure database, or two? What if we had a protein with a mix of D and L amino acids? The cheminformatics substructure would know the difference, but the PDB substructures wouldn’t. How will we tell the difference between serine and cysteine? Only difference is O-->S How will we tell the difference between methionine and selenomethionine (S-->Se)? LW – could mandate that element is required. LW – I tried loading a element-less PDB file using OpenMM and it correctly guessed the elements (probably using atom name) (LW tried loading https://files.rcsb.org/download/6CPZ.pdb using the same method, to see whether it distinguishes between Se and S in the selenomethionine)
So, we need the element column to be required. The best outcome will be reading PDB/PDBx/MMCIF using RDKit (and possibly also OpenEye) For now let’s aim to load PDB using RDKit. The initial plan will be to not support MMCIF until it’s in RDKit, though if we have extra dev time, we can try making MDA/MDT a soft dependency and allow users to try loading MMCIF with those.
The next best outcome would be to use OpenMM/MDAnalysis/MDTraj
|
Next steps | Beginning implementation Should start building on a branch of the openff toolkit (not a fork) This can follow the “major feature branch” development pattern, where we target PRs into a branch instead of into master. This way, we can review the changes in small bits (and skip review when Jeff’s busy), and not have to review the final product in its entirety before merging to master. Eg:
Squash merge!! each PR into the feature branch. This is so that, if we decide to make major design changes midway through, but don’t want to throw everything away, we can cherry-pick entire PRs into wherever we start working next.
|