Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
Been porting code I wrote in python to rust. Can load CCD in about one minute. Determined it’s not worth keeping the whole CCD around, just download parts as you need it. Have a system where it can stream from disk, so can process the whole PDB without running out of memory. Has an API for accessing records in a structured way. Have a pathway to load PDB into records and process it, and finally arrive at Python.
The loader so far is pretty strict and errors out if anything ambiguous is found. So we’ll need to confront those issues at some point, eg
atom index/serial number being reused, which is happens over 10(0?)k atoms and confounds CONECT records.
There’s also a charge column, but semantics are weird - like ` ` and 0 means the same thing, there’s no way to represent “unknown”.
Also need to handle missing atoms/altlocs
JW –
Let’s review big picture plans - Was under the impression that we wanted to show ad board what current loaders could do, and then use that as evidence that we want to make a new one. Also I don’t want us to ship rust code.
JM – I’m thinking the same thing, just haven’t been able to get the working of a better implementation out of my head. Rust just seems like the right way to write a performant/optimized version of this, but the advantages of rust don’t seem that impressive here, and python would be fine.
JW – You and BW are really big on Rust - Maybe the employee growth assessments will be a good way to communicate this up to management - We’re supposed to be helping people grow and learn new skills and maybe that means adopting some Rust as an org. So growth assessemnts will be a good way to pass this up to project management.
JM – It’s an interesting language, but its benefits don’t come through in this case. If performance were super crucial then it could make sense, but that’s probably not the case here. Also deployment will be a little more complicated (though it’d be quite smooth if we deployed using pip)
Current status -
Workflow through PDBFixer - OFFTK is far too slow to do whole PDB so need to use a stride/some other sampling.
Haven’t tried using MDAnalysis guesser yet.
Pathways -
PDBFixer → Topology.from_pdb
PDBFixer → MDA guesser → RDKit mol → OFF Mol/top
New thing being written
JW – We should assume that we’re given an explicit protonation state (we don’t add Hs)
JM – Should we look up SMILES/SDF for small mols being loaded?
JW – No, in the OFF loader pathway this should just fail in the first run on this study.
JM – Is there a database of PDB files that aren’t straight from the PDB (and don’t have “correct” ligand names)?
JW – Our protein-ligand benchmark dataset, all from Schrodinger
JM – Next up, I’ll put the workflows above together and run the census on the PDB (or a representative slice thereof). I’ll try to get this to you for our next meeting.
JW –
Next Mon is a US holiday, then I’m in Europe the two mondays following it (giving a talk at RDKit UGM in two weeks!). Could we skip next week’s and reschedule the following two?
Cancelled next week’s, rescheduled others to 5 PM canberra / 9 AM berlin
Any tips for hot new material for RDKit talk?
JM – Interchange run file exports + better interop (resnames and stuff)
Still working on FB debugging. Reported a bug to GROMACS and fix should be in upcoming release.
We currently use dockerhub, but something about our docker subscription lapsed, so I’ll need to move the QC worker docker images and conda env yamls to somewhere else - likely github.