Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Item

Notes

QCA dataset submission details/timeline

  • DD – For our previous big dataset (Industry Benchmark), we had issues submitting a dataset containing 70k molecules adding up to about 500k “entities”. Due to some technical debt in metadata handling, a large data blob needs to pass through several interfaces, and some interfaces have hit size limits. So until this process gets improved, it may make sense to split the dataset into separate blobs and stitch them together later.

  • PE – There will be conceptual divisions in the dataset, and we can add technical divisions as needed. Some datasets will only be ~10k conformations. But others will be based on enamine and have >1 million conformations.

  • DD – For now the safe thing to do would be to stay under 250k entities per dataset.

  • PE – Next will be drugbank, likely on the same order of size. Then there’s DS370k, which will also be large.

  • DD – SBoothroyd had suggested making a “test set” to test out the submission process without having a large resource requirement.

  • PE – My current dipeptide submission will be under a thousand molecules, just a few confs each.

  • DD – Sounds good. It’s just important to stay under 250k entities, and remember that different compute specs will multiply the number of entities.

  • PE – This submission should just be a few 10k entites.

  • PE – Most of my datasets moving forward will be ~10ks , some in the hundreds, and one or two in the millions. But I’m happy to split those up for technical reasons. Timeline for fix?

  • DD – Earliest would be deploying the fixes around the end of 2021. But the timeline is fairly uncertain, and this is happening with the background of a more significant refactor.

  • PE – Happy to start with smaller datasets as we increase dataset size, so this may line up with the schedule anyway.

Dipeptide submission working session

  • We need both a “full chemical graph” and a “correctly atom typed PDB” for each molecule in the workflow. It’s fine if these are created using different methods, as we can match a full chemical graph to a PDB when needed. Using openMM, we can go from sequence (26AA) → correctly atom typed PDB. But how can we go from sequence (26AA) → full chemical graph?

  • XYZ2MOL and OBabel can do OK jobs of guessing bond orders+formal charges from PDB/XYZ, but they still have high single-digit error rates.

  • OpenMM HAS correct bond orders for the mols it makes from sequence, this should theoretically make it possible to deduce formal charges (the same goes for the other way around too). But this would require rolling our own solution, and may still run afoul of tricky cheminformatics problems. So the ideal solution will have a way to convey EXACTLY the chemical graph we intend, with no interpretation needed.

  • Can RDKit make the chemical graphs we need from sequence?

    • No, it can’t handle sidechain protonation variants

      Github link macro
      linkhttps://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/FileParsers/SequenceParsers.cpp#L608

  • Could we get the full chemical graph of dipeptides using the tleap method (like from amber-ff-porting?)

    • Maybe, but this would need to tiptoe around mangling atom names and reconciling different meanings for bond orders/formal charges

  • Obabel perceives bond orders+formal charges for “properly-formatted” pdbs, eg obabel -ipdb disulfide.pdb -osdf -O disulfide.sdf

  • We’ll try openmm → PDB → OpenBabel → SDF → OpenFF

...