IP – I’ve made substructures that fully match T4 lysozyme based on AMBER substructures
IP – There's a problem with excessive runtime to do all this substructure matching.
JW – This is hard to avoid because we can’t reduce the protein as we label parts of it, and we will likely have to search through the entire SMARTS list because a real protein will have at least one of each amino acid.
IP – It would be very helpful to have the caching implementation done
JW – I could merge the incomplete implementation into the biopolymer topology feature branch and leave a failing test for it, so that we know to fix it before we merge.
Let’s do this – Could even try during today’s working session
JW – Could also, instead of using find_smarts_matches, which runs to_rdkit every time, we could make find_multi_smarts_matches, which takes as input a LIST of smarts, and only runs to_rdkit once for the whole thing.
IP – When matching, if multiple substructures match the same atoms, I take the largest one.
JW – This is a pretty good idea. But what if their residue database has one residue that looks like alanine, and another that looks like alanine+a neighboring backbone. How would this know what to do with a ALA-ALA sequence?
from openff.toolkit.topology import Molecule
# Will need to fix carboxylate bond orders here
mol.to_file('ALA.sdf', file_format='sdf')
Where else could protein SDFs come from? IP will ask on developers channel, possibly also directly ask perses devs.
IP – I’ve also tried out some mmcif parsers. None of them are particularly friendly, but I think biopython is the best.
(IP gave demo of using mmcif to read T4 lysozyme and show iterators)
JW – Can it read components.cif?
IP – It doesn’t do a very good job. It seems to overwrite a lot of what it reads because it doesn’t understand the multi-entry format of components.cif.
(General) – We could either chunk up this file using python readlines ourselves, or keep looking into the API docs for biopython to see if it gives a different kind of iterator.
RDKit deterministic confs PR
The reduction in the number of generated conformers is not due to the canonical ordering – It’s all due to changes in the 2021.03 RDKit release
OpenEye tests started failing when we set omega.SetCanonOrder(True). This is the OPPOSITE of what we’d expect.
IP will revert the OE change in the PR
IP will contact support@eyesopen.com with a reproducing example of the behavior – We shouldn’t even need to run AM1 calculations, just the omega conformer generation