Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device.
Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
Hi Hannah, Sorry to drop the ball on this. I just looked at this with Josh (cc'ed) and three things pop up - For Pablo to work:
Atoms in residues must follow an atom naming convention (since Pablo currently just looks up the bonding information from the residue+atom names)
The atoms in a given residue must be contiguous in the PDB file (in the polymer.pdb file, they aren't)
The ResidueDefinitions must have some atoms marked as leaving (Pablo expects all atoms that aren’t marked as leaving to be present, even the H's that leave when polymerization occursso if the molecule doesn’t have anything marked leaving it’s effectively a “capped” residue)
Some high level thoughts - If the purpose of using Pablo in this context is to get coordinates from the result of a GROMACS calculation, it's probably easiest to have an OpenFF Topology/Interchange made beforehand (made directly from the RDMols that come out of SwiftPol), then run the steps that will modify their coordinates (GROMACS minimization, if I recall correctly?), then read the coordinates that come out of that step separately and copy them into the Topology/Interchange (should be possible using OpenMM GROMACS/PDB file reader or Interchange's GROMACS loading support). I know you'd had problems with this route before (Josh suspects maybe the box size changed) but debugging the coordinate-copying route is likely to be far easier than having swiftpol support consistent atom naming across residues.
Anyway, I think it'd be best to work on this via screenshare. Do you have availability in the next week or so? I'm in California and Josh is in Australia, so we have some availability in your morning/afternoon. Maybe this Friday at 8 AM US Pacific/4 PM UK? If not, let us know when else you might be available.
JE - Put proposed next step first - Start with “I think the best way forward is screenshare, let’s coordinate a time with me and Josh”
MT – Agree, lead with “let’s have a meeting”, and put the dense ideas in a “here are the ideas I have”, soften the expectation that the recipient will be able to grok everything.
JE – Probably worth having the dense stuff in there, but it should go AFTER the “here’s what you should do and why”. The “this is really dense and complex” part is supporting the “we should meet” message
Matt’s
Chapin
Original (Slack message)
Following up on Tier 3 protein force field benchmarks. I've added new targets for 11 folded proteins and 2 disordered proteins to the main branch of the proteinbenchmark repo here: https://github.com/openforcefield/proteinbenchmark . I've also set up a Confluence page where we can track who is responsible for simulating each system: Protein benchmark systemsPreview .For the 11 folded proteins, I selected this subset from 40 proteins solved using both NMR and x-ray crystallography by the NorthEast Structural Genomics Consortium (NESG): https://pubs.acs.org/doi/10.1021/ja409845w . For the full set of 40 proteins, I classified the tertiary structure using the hierarchical CATH classifier: https://www.cathdb.info/wiki . There were 13 unique folds at the second level of the CATH hierarchy (which considers relative placement of secondary structures but not connectivity in primary sequence). Two of these folds were already covered in the Tier 2 folded proteins. For the remaining 11 folds, I selected the protein with the fewest residues for the new set of benchmark targets. The starting model is the NMR model deposited in the PDB by the NESG.For the 2 disordered proteins, I selected amyloid beta 1-40 and alpha-synuclein as common benchmark systems for IDP force fields. The naive extended starting state (all backbone dihedrals at 180 deg plus 1.4 nm solvent padding) that I've used for short peptides produces enormous water boxes (larger than GB3 by 40-fold for AB40 and 1000-fold for alpha-synuclein), so I'm going to generate more compact starting structures consistent with an estimated Rg using IDPConformerGenerator from Julie Forman-Kay: https://github.com/julie-forman-kay-lab/IDPConformerGenerator . This will take a bit of time to implement.
In the meantime, the folded proteins are ready to simulate. I think we should start running benchmarks with ff14sb-tip3p (TIP3P is co-tuned with ff14SB parameters), ff14sb-opc3 (OPC3 is co-tuned with our OpenFF NMR parameters), and null-0.0.3-pair-nmr-1e4-umbrella-1e3-4-opc3 (this is the best candidate, Null-US-UB-1E3-4, in my recent slides). I've checked that I can build and minimize the new targets in OpenMM on my Linux workstation.If you want to volunteer to run benchmark simulations of some of these targets, add your name to the Simulator column in the table for Tier 3 on the Confluence page. If you don't have access to edit the page, let me know what systems you want to run in a thread here. Run three replicas for each force field and water model combination. We decided that we're okay running targets in different MD engines, but we do want to simulate all force fields for a given target with the same engine. Volunteering for a target means you are also committing to running simulations for that target for other force fields in the future (e.g. future release candidates).To run the benchmark simulations, clone the proteinbenchmark repo above. Then run the following to install.
You can also use the proteinbenchmark-simulation.yaml environment to build a smaller environment without setup or analysis tools for a cluster where you will only run dynamics for an already built system. Then, using the attached python script, setup a benchmark simulation using python run-protein-benchmarks.py -f $FORCE_FIELD -t $TARGET setup $OUTPUT_DIR where $FORCE_FIELD is the force field name including water model, e.g. ff14sb-tip3p, $TARGET is the string in the "Short name" column on the Confluence page, e.g. ccr55, and $OUTPUT_DIR is the relative path to the output directory where results will be written. Run dynamics using python run-protein-benchmarks.py -f $FORCE_FIELD -t $TARGET -r $REPLICA run $OUTPUT_DIR where $REPLICA is an integer specifying an index for independent replicas with different random seeds. Run analysis using python run-protein-benchmarks.py -f $FORCE_FIELD -t $TARGET -r $REPLICA analyze $OUTPUT_DIR.Let me know if there are any questions about the new targets or about using the proteinbenchmark repo. cc: Anika Friedman and Julianne Hoeflich.
Revised Following up on Tier 3 protein force field benchmarks: the folded proteins are ready to simulate. At the last force field meeting, we decided that the following people would help run these (Anika Friedman and Julianne Hoeflich). I have prepared the inputs and you can find instructions on how to run them here: Protein benchmark systemsPreview . Anika and Julianne, do you need anything from me before you start running?
Josh’s
This speedup is great! I'm worried about hash collisions - this implementation does not guarantee that different molecules will compare unequal. I've done some testing (see below) and so I’m BLOCKING on either:
assigning _molecule_atom_indices in Molecule._add_atom() and then comparing tuples of the relevant data in _is_exactly_the_same_as (see proposed edits in other comments),
OR using a collision-resistant hash function like hashlib.sha3_224 (this is a one-line change)
Issues with builtin hash()
I think the approach of checking for equality by comparing Python builtin hash()es of the relevant information is fraught:
I think the chances of a hash collision are worth taking seriously - partly because the hash, on my machine, is only 64 bits long, much shorter than the string that would be fed into it, but more because it would be such an infuriating and embarrassing bug to run into in the unlikely case that it came up.
Testing Results
I've done some testing and it seems that the vast majority of the speedup comes from looping over the atoms and assigning their _molecule_atom_index attribute. Once that's done, the difference between comparing hashes or just comparing the big id string that gets hashed is basically nil. I've also found that comparing tuples of the attributes you want to compare is much faster than iterating over the actual attributes; in other words,
def _is_exactly_the_same_as(self, other):
self_id = (
tuple((atom.atomic_number, atom.formal_charge.magnitude, atom.stereochemistry) for atom in self.atoms),
tuple((bond.bond_order, bond.stereochemistry, bond.atom1_index, bond.atom2_index) for bond in self.bonds),
)
other_id = (
tuple((atom.atomic_number, atom.formal_charge.magnitude, atom.stereochemistry) for atom in other.atoms),
tuple((bond.bond_order, bond.stereochemistry, bond.atom1_index, bond.atom2_index) for bond in other.bonds),
)
return self_id == other_id
Is about 5x faster (in this case where the molecules are large and the same) than
def _is_exactly_the_same_as(self, other):
for atom1, atom2 in zip(self.atoms, other.atoms):
if (
(atom1.atomic_number != atom2.atomic_number)
or (atom1.formal_charge != atom2.formal_charge)
or (atom1.is_aromatic != atom2.is_aromatic)
or (atom1.stereochemistry != atom2.stereochemistry)
):
return False
for bond1, bond2 in zip(self.bonds, other.bonds):
if (
(bond1.atom1_index != bond2.atom1_index)
or (bond1.atom2_index != bond2.atom2_index)
or (bond1.is_aromatic != bond2.is_aromatic)
or (bond1.stereochemistry != bond2.stereochemistry)
):
return False
return True
In fact, that first implementation if _is_exactly_the_same_as() is just as fast as computing and comparing the hashes if the molecule atom indices are already assigned.
Shouts-out
@James Eastwood
JW – Shout out to the whole team for taking time off this summer! It’s great to see that things keep moving forward even when key people are offline.