Checked out CCTBX, this DOES apply residue names and stuff, but it DOESN’T apply bondorders + formal charges. Largely aimed at crystallographers with messy data. So one module handles PDB, another handles MMCIF.
IP – Kinda tricky because it pulls in a bunch of dependencies
JW – Licensing issues?
(General) – It’s MIT or BSD3 licensed
Next steps
What loading pathways do we WANT to offer?
Load from Element + bond existence
Element PDB w/ CONECT
mmcif w/ bonds
Load from atomtyped representation with atom names matching a known typing scheme
Atomtyped PDB w/o CONECT (Perses entry point)
mmcif w/o bonds
What loading pathways CAN we offer?
(slow) Molecule.from_pdb (matching to residue templates)
Element PDB w/ CONECT → OFFMol
(would require more work) mmcif with bonds → OFFMol
CCTBX
Element PDB → Atomtyped PDB w/ CONECT
What prep method will people have used beforehand?
AMBER tleap protein prep → SDF that can probably be fixed
IP – We could speed up subgraph matching by splitting at peptide bonds.
JW – Agree. But how do we empower users to handle their own corner cases?
IP – Could let users add new residue SMILES
IP – Could let people match only a range of atoms for complex molecules
JW – This could work well, but then we may end up with partially-annotated molecules, and that could get really messy if people try to assign chemical information in different steps – When they try to convert to a full OFFMol, it’ll be hard to communicate which parts didn’t get bonds+formal charges.
JW will plan to have TypedMolecules optionally hold element, formal charge, stereo, and bond info, and potentially let them be upscaled to OFFMols if all info is present.
IP will try to speed up subgraph matching by splitting at peptide bonds. This will provide a prototype and early users to start providing feedback and finding corner cases.
IP – I spoke with DHahn the other day. I’ll be contributing a bit to the PLBenchmarks repo, and will probably also be involved in the continuous benchmarking efforts. I’m thinking about making a PLBenchmarks conda package.
JW – PLBenchmarks has a bunch of protein structures prepared in Schrodinger, so that will be a great source of example input data.
IP – CCTBX refused to read these, they probably violate the PDB spec in some way.
JW – I don’t think the biopolymer stuff will be in a major OpenFF Toolkit release in 2021. Instead, we should either direct people to do development builds from the branch, or I can make omnia conda packages from the topology-refactor branch
Add Comment