2021-05-20 Topology Working Session Meeting notes

Date

May 20, 2021

Participants

  • @Jeffrey Wagner

  • @Iván Pulido

  • @Lily Wang

Discussion topics

Notes

Notes

  • IP – I’ve made substructures that fully match T4 lysozyme based on AMBER substructures

    • IP – There's a problem with excessive runtime to do all this substructure matching.

    • JW – This is hard to avoid because we can’t reduce the protein as we label parts of it, and we will likely have to search through the entire SMARTS list because a real protein will have at least one of each amino acid.

    • IP – It would be very helpful to have the caching implementation done

    • JW – I could merge the incomplete implementation into the biopolymer topology feature branch and leave a failing test for it, so that we know to fix it before we merge.

      • Let’s do this – Could even try during today’s working session

    • JW – Could also, instead of using find_smarts_matches, which runs to_rdkit every time, we could make find_multi_smarts_matches, which takes as input a LIST of smarts, and only runs to_rdkit once for the whole thing.

  • IP – When matching, if multiple substructures match the same atoms, I take the largest one.

    • JW – This is a pretty good idea. But what if their residue database has one residue that looks like alanine, and another that looks like alanine+a neighboring backbone. How would this know what to do with a ALA-ALA sequence?

  • Other sources of protein structures for testing:

    • PDBs:

      • https://github.com/MCompChem/fep-benchmark/

    • SDFs:

      • To get SDF from PDB:

        • in tleap:

          source leaprc.protein.ff14SB mol = loadPdb ALA.pdb saveMol2 mol "ALA.mol2" 0

          in python:

        • from openff.toolkit.topology import Molecule # Will need to fix carboxylate bond orders here mol.to_file('ALA.sdf', file_format='sdf')

           

      • Where else could protein SDFs come from? IP will ask on developers channel, possibly also directly ask perses devs.

  • IP – I’ve also tried out some mmcif parsers. None of them are particularly friendly, but I think biopython is the best.

    • (IP gave demo of using mmcif to read T4 lysozyme and show iterators)

    • JW – Can it read components.cif?

    • IP – It doesn’t do a very good job. It seems to overwrite a lot of what it reads because it doesn’t understand the multi-entry format of components.cif.

      • (General) – We could either chunk up this file using python readlines ourselves, or keep looking into the API docs for biopython to see if it gives a different kind of iterator.

  • RDKit deterministic confs PR

    • The reduction in the number of generated conformers is not due to the canonical ordering – It’s all due to changes in the 2021.03 RDKit release

    • OpenEye tests started failing when we set omega.SetCanonOrder(True). This is the OPPOSITE of what we’d expect.

      • IP will revert the OE change in the PR

      • IP will contact support@eyesopen.com with a reproducing example of the behavior – We shouldn’t even need to run AM1 calculations, just the omega conformer generation

      • (did more digging, more details in PR comments)

  •  

Action items

Decisions