Day 4 Notes

Attendees:

Eastwood

L Wang

Wagner

Mobley

Cavender

Gokey

Gilson

Kumar

W Wang

JH

J Clark

AF

Morales

Thompson

Cole

Bernat

Turney

F Clark

Finlay Clark: Protein-ligand binding free energies

PhD project: absolute binding free energies

a3fe:
automatically selects restraints which maximally restrict configurational space
- tends to find stable residues and relevant protein-ligand interaction points
has some magic to pick which lambda windows to sample more and which to sample less
- aside: Gelman-Rubin (Stat. Sci. 1992) helped to identify sampling issues in some lambda windows
also some work on “robust truncation point selection”
MG: What are the error bars in slide 31?
- FC: Errors of errors, from bootstrapping
J Clark: What about bimodal distribution?
- FC: Probably picking a local minima in very messy data. Hard to work around data that’s so noisy that the autocovariance function is itself noisy
J Clark: Do you ever detect a second equilibration time?
- FC: Can never really detect that
MG:
We don’t do any subsampling ebfore we estimate free energy. could sue that for bootstrapping, but we don’t do that now.
MG – Slide 31, second case. PDE2a blue curse shows initial bias dropping off quickly. Why doesn’t autocorrelation tell you that bias decayed off in 3ns? Why would it push you beyond that?
FC – We had an informative example where we ran this with no bias, and found that it discarded a lot of the sample. … A lot of this is because, if there’s a local minimum at the end of your data, then that will heavily affect the metrics of convergence/truncation time.
JW – Plans for future communications? Newcastle meeting will connect OpenFF and FClark. What’s a good way to keep in touch with OpenFE?
- DC – Good Q. Will start getting involved in alchemiscale as well as a bridge to OpenFE

Hannah Turney: Formulations

Collaboration with J&J
Excipients are compounds that are not the active drug but are important for delivery etc
Polymer structure & properties affect drug delivery behavior
Built for automating polymer building with a bunch of knobs
Did a case study in PLGA
Software bottlenecks in “molecule” (polymer) handling makes OpenFF intractable to use
JW: Software bottleneck is something OpenFF team should handle. Your workaround is probably something we shouldn’t need to do
DC: Interested in Espaloma result
MT – I see NAGL agrees better with AM1BCC than espaloma. But is there an experimental definition of “correct” that we can compare to?
- HT – Right, this is just an early exploration.
MG: These charges look surprising to me, I’d expect carbonyl charges to be <1
JW: Long-term, what sort of systems are you building? Put a ligand in a box with polymers …. what do you want to work toward?
- HT - Will look at diffusivity, solubility, properties of the drug molecule. And look at “points of hydrolysis” points on polymers in which polymers break down. Build a broken-down polymer and see how much differently the API behaves
JW: Do we think our force fields will handle these crazy chemistries well (i.e. low pH of stomach acid).
- MG: The idea seems to be to model the effect of the break down, not the process.
- HT: Yes
- MG (+ others): Would expect the force fields to do a good job. Haven’t looked at it specifically.
CC: How are you building these systems? Packmol?
- HT – Polyply, requires GROMACS inputs, but interchange gets me those. Coudl be used for other purposes but would introduce a gromacs dependency.
- MT – Polyply could be a good thing to investigate for packing in general, mabe could get folded into an example if it works well
J Clark: How you control the tacticity? Can you control tacticity wrt to different copolymer components
- HT: This is an input into swiftpol

Chapin: Protein force field update

We used to think we were going to use library charges and had some problems with them. I think now we should use NAGL and this problem is solved.
- Had some issues with charges on caps mucking with charges on residues a long way away from the caps
- NAGL does not do this. So I think we should ditch library charges and use NAGL in the protein force field.
  - Minor “issue” with NAGL is that some residues don’t have integral charges. But maybe this is okay since that’s really what happens in nature.
- JW: It’s nice to see (elsewhere?) that ff14SB and NAGL charges are usually pretty similar. But we have a philosophy of using high-quality QM for small molecules, is that really what we should be doing for proteins?
  - CC: (Mixed bag). Yes this is an issue
- JW: We messed around with implicit solvent but some of the QM data was bad.
  - DC: Issue was that we couldn’t run with diffuse basis set (cloud of smoke) this appears to be resolved
    - PK: This is a Psi4 issue
  - DM: Could always use better theory, but we’re currently using good theory. Can always revisit in the future
  - MT: Are you saying “I am switching over from library charges to NAGL?” If you use NAGL and it works for the protein force field, then every future generation of NAGL has to maintain performance on the protein force field.
  - JW: In some ways this is de-risked by the recent decision to treat NAGL like any other cfharge provider, where we can change the implementation in a new version and it will break performance of with older versions of force field. So we are already going to pin a force field to a NAGL model.
  - MT: The sooner that can be encoded, the better, so we don’t get confused
- LW: Do you anticipate have issues with charges that fluctuate a little bit between residues (i.e. not fixed per-residue)
  - CC: Could be an issue, but I think this is actually a GOOD thing
  - MG: I thought the variance was low?
    - LW: We looked at it and it was low. Only varies based on bond environment, not 3D structure
  - MG: Is this variance physically interpretable?
  - DM: Essentially NAGL has learned electronegativity and hardness
  - MG: Interesting, not necessarily scary. Unsure if it’s physically reasonable
    - LW: Variation is lower than AM1-BCC, higher (by definition) than template-based charges
  - CC: About generalizing from proteins to polymers. This could be an advantage of mixing charges in vacuum and implicit solvent as, e.g. in HT’s project one could tune or pick mixing parameters for particular applications
    - JW: sounds good as long as the FF is fit with a recommended mixing parameter and users can tweak that as necessary
  - JCl: We should benchmark any changes from recommended FFs. It might not be valid for the rest of the parameters
    - CC: agree. Might be of interest to the Matta group.
    - LW: think we’ll probably put out a FF with recommended settings and there will have to be a very high bar for users to validate their own projects before things change
    - JW: agree, …
    - CC: agree, just that our FFs will assume a particular mean polarisation aimed at our datasets of interest (primarily aqueous), this just allows people to fiddle around for their own custom data
CC: Issues with default ForceBalance settings when treating the protein just like any other molecule. Protein parameters didn’t really change much. Resolved this by heavily up-weighting protein data and plan to do this for the “first” protein force field. Think we should turn this off for subsequent force fields
- MT: How does smee relate to this?
- DC: Might see similar issues if using smee. Not sure. Lots of remaining questions about how this would work (data sets, weighting of different reference data)
- CC: Tried this out, performed okay. Found that dropping OpenMM minimization (FB TorsionProfile target to FB AbInitio target) was faster but less accurate, but using a custom objective function inspired by ff14SB was faster and equally accurate.
- Some issues with the parameter surface for polymers vs. small molecule QM fits
  - Topic for spin-off research: use different optimizers for these fits
- …
CC: Plausible to me that some decisions around periodicity that were made for ff14SB aren’t appropriate for OpenFF. One of the mutations in the genetic algorithm involved randomly changing periodicity. Another is changing phase from 0 to 180. Doing this would be interesting. Might help small molecule parameters as well
- TG: Is the trick that Amber only allows positive k?
- CC: That’s true for periodicity of 1. Not 2+
- DM: Does this related to Danny’s observed issues with periodicities with smee?
  - DC: Kinda
- CC: We could start with many periodicities, do LASSO regression to prune terms that aren’t helpful, then do a second stage optimization with the non-zero LASSO terms.
- MT: …
- CC: was bringing this up because not sure how much the optimization strategy is adequately traversing the landscape. Any gradient based optimizer would likely have the same problem.
CC: Benchmarking. My strategy is to look at bigger systems with more degrees of freedom. Drawback: More expensive to run simulations, get observables. Trying to shortcut this by running different tiers of benchmarks
- splitting out into “quicker” benchmarks that short-circuit large issues for triage but longer-running benchmarks for the “whole” benchmark set
- some larger proteins (MD is still slow)
- some smaller peptides (quicker?)
- Think this hierarchical setup is a good candidate for other polymer systems
JW: (recaps packmol discussion from yesterday) do you have a wishlist of features in system set up for benchmarking?
- CC: treatment of protonation states. I’m using pdb2pqr now, but it’s optimised for proteins. This matters for benchmarks. pKa prediction for small molecules (and large?) would be great.
- DM: that would be a huge project that would probably not be within OpenFF/E scope.
- JW: we can streamline use of an existing tool like pdb2pqr
- DM: I think OMSF is going to own pdb2pqr temporarily
- CC: system set up will be very system dependent.
- MT: what’s the audience for this? Public-facing or expert users? Can build specific stuff more easily for internal use
- MG/JW: clarification on that protonation states must be loaded into the tool.
- DM: Our PDB loader hopefully makes it easier to use external protonation state tools since they do weird things with hydrogens and atom names. And users can define these themselves
  - JW: There will be some shims for very-not-in-the-spec things that users want to do
CC: General point: for protein benchmarks, need to run things as close to experimental settings as possible. Temperature, pressure, pH, salt concentration, (for membranes) surface tension
- JW: For you (CC) we know you can do the system prep well yourself. Not always the case with external users who need to do wider use cases. Hesitatnt to bake stuff in until it’s really part of a final study
- DM: Some other context-dependent parameters, like where exactly you put the ligand
- CC: I have some code that does some of this
  - Basically a big dictionary of metadata
  - JW: Wonder if this setup would useful for Hannah? HT: Sure
AF: Did benchmarking on a bunch of proteins. Kept seeing secondary structure issues. CC tried getting around this by fitting to NMR. We tried (something with the PDB). Developing hypothesis that the issue is training data, not getting enough helical turns and some issues with inter-residue contacts. Doesn’t seem like the issue is how things are being fit
- MG: Status of new QM data?
- AF: Waiting for new QM data to be finished.
- AF: We picked a few thousand structures that represented diversity in the PDB. Something about getting a bunch of torsional something. Looked for simplified sequences ?. Selected 200 unique 4mers with a good diversity in torsional space. Conformers centroids of torsional space. 1000 conformers from 200 sequences. Had to write code to cap without messing with torsions. This dataset is optimizations, not torsion drives.
- How long will this take? J Clark: 10 days
- MG: … seems excited. Some ideas about improving quality of QM data, how restraints are applied or something
- … timeline of adding this into a refit?
- CC: Could add them in immediately

Polymer discussion

JW: Slide(s) on defining common terms

Build topology/coordinates
Ingest into OpenFF
Parameterize & output to MD engines
Run simulations
Analyze
Build a polymer FF

AF: Should we start benchmarking polymers? Have somebody in Shirts group that could do this.
DM: General attitude is that we don’t have resources to work on stuff, but would be happy to see new data

JW: What’s going on with MuPT?

TB: Lots of stuff. Want to use atom-typed force fields, custom residues, …

JW: Can load polymers if you define monomers with residue name and atom name and some other stuff. This uses openff-pablo. …

JW: What’s the biggest bottleneck in polymer simulations?

JH: Equilibration, but only because dynamics are slow. This is a force field, not an infrastructure problem

HT: Molecule de-duplication issue I reported earlier

MRS: context for polymer discussion (since I’ll be teaching)

MuPT - funded collaboration for general tools for setting up polymer simulations

This covering all soft materials (including handling both CG and AA), but definitely includes polymer/protein interactions. Relevant touchpoints with OpenFF.

Box of proteins + polymers
Crosslinked proteins/polymers (PEGylation, attaching fatty acids)
Glycoproteins
Bringing in PDBs from however they are generated and parameterizing
Turley/Matta formulations project also shows utility of polymers in drug design
We want to support non-OpenFF force fields as well, via Foyer

A lot of these MuPT needs overlap with tool requirements for OpenFF. Proteins/nucleic acids/glycoproteins are a subset of polymers . . . What are those overlaps/opportunities for shared tooling? DISCUSS!