General updates | JM – Have a summary of PDB loading so far - Went through and pulled out every PDB ID that only has 20 canonical AAs - About 20k of those. Tested my loader (which works directly from CCD) on those 20k, failed on 8k. I think the difference is protonation state. Then I trimmed down to PDB IDs with at least 1 hydrogen atom, about 10k of those. Topology.from_pdb raised an error on 5744 of those. The unloadable structures may have missing atoms. Many termini in the aa_variants file have capping/terminal chemistry that I’m not familiar with, not sure if they’re bad entries in CCD or if they’re real. JW – I’m surprised that, of 20k structures, more than a few had all Hs JM – Since I filtered to canonical AAs only, I think I wound up with mostly NMR structures. JW – I’d like to avoid relying on aa_variants file, since we’ll need to load all sorts of other variants. JM – I plan to have different databases available to users - so can have some custom databases for loading from different sources, and let user decide what to use.
Then tried loading using (PDBFixer → Topology.from_pdb) spent a few days optimizing it. Took 12 hours to get through 100 entries (I think because it runs a bunch of energy minimizations). It’s also possible that I was tripping a very rare error collecting case in OFF Toolkit (I log error data, there may be something slow there). Could also be in pdbfixer or from_pdb itself, would need to look deeper. Overall, of the 100 I tried, 42 failed. The errors that were raised by the PDBFixer pathway were basically all UnassignedChemistryInPDBError. JW – Any testing of MDAnalysis loader? JM – I haven’t tested with MDA loader yet. JW – Ok, it’d be good to test this before we invest too much in our own loader, since this may be everything we need already. Will probably still need PDBFixer to assign protonation though.
JW – So we’re looking at (Topology.from_pdb), (PDBFixer → Topology.from_pdb), (Josh’s loader), (PDBFixer → MDAnalysis loader). Wondering about whether to develop these workflows/datasets further or to move toward plots/numbers to report up. JM – So I’ve done the test of loading structures with canonical AAs with explicit Hs using (PDBFixer → Topology.from_pdb) and (Josh’s loader) and reported the numbers above. JW – Let’s look at it like the table below and think about if we can fill this out. JM – One tricky question is whether we can modify topology - For example things get much easier if we can revert NCAAs to canonical versions, and PDBFixer supports this. JW – Since our REAL final product will require users to come with all explicit atoms (incl Hs), we should just use PDBFixer to get things to a state with all explicit atoms with minimal work. Which probably means capping chains when there’s a missing loop. JM – So maybe I should preprocess a representative sample of the PDB with PDBFixer so that we have that as our starting dataset… Could use access to larger compute for this. JW – I’ll get you on NRP. Important thing there is to not waste compute, so be sure to utilize what you request. Start off with no more than 100 cores until you know what you’re doing. (JW adds JM to NRP, will coordinate async from here on)
JW – In Zurich for RDKit UGM, responses may be delayed, and I’m offline Thursday and Friday. You were a bit slow on getting back to MT on two PRs, so eh went ahead and merged them without your feedback. I also just saw that MT opened an issue on the failing vsites notebook in the docs. Please prioritize these collaborative/communication tasks more in the future. I think it’d be good to spend no more than 75% of your time on the PDB loading stuff, and prioritize things like timely PR feedback.
|