Attendees:

Eastwood

L Wang

Wagner

Mobley

Cavender

Gokey

Gilson

Kumar

W Wang

JH

J Clark

AF

Morales

Thompson

Cole

Bernat

Turney

F Clark

Recording: Video Conferencing, Web Conferencing, Webinars, Screen Sharing
Passcode: e=4$&?q#

Finlay Clark: Protein-ligand binding free energies

PhD project: absolute binding free energies

a3fe:
automatically selects restraints which maximally restrict configurational space
- tends to find stable residues and relevant protein-ligand interaction points
has some magic to pick which lambda windows to sample more and which to sample less
- aside: Gelman-Rubin (Stat. Sci. 1992) helped to identify sampling issues in some lambda windows
also some work on “robust truncation point selection”
MG: What are the error bars in slide 31?
- FC: Errors of errors, from bootstrapping
J Clark: What about bimodal distribution?
- FC: Probably picking a local minima in very messy data. Hard to work around data that’s so noisy that the autocovariance function is itself noisy
J Clark: Do you ever detect a second equilibration time?
- FC: Can never really detect that
MG:
We don’t do any subsampling ebfore we estimate free energy. could sue that for bootstrapping, but we don’t do that now.
MG – Slide 31, second case. PDE2a blue curse shows initial bias dropping off quickly. Why doesn’t autocorrelation tell you that bias decayed off in 3ns? Why would it push you beyond that?
FC – We had an informative example where we ran this with no bias, and found that it discarded a lot of the sample. … A lot of this is because, if there’s a local minimum at the end of your data, then that will heavily affect the metrics of convergence/truncation time.
JW – Plans for future communications? Newcastle meeting will connect OpenFF and FClark. What’s a good way to keep in touch with OpenFE?
- DC – Good Q. Will start getting involved in alchemiscale as well as a bridge to OpenFE

Hannah Turney: Formulations

Collaboration with J&J
Excipients are compounds that are not the active drug but are important for delivery etc
Polymer structure & properties affect drug delivery behavior
Built

for automating polymer building with a bunch of knobs
Did a case study in PLGA
Software bottlenecks in “molecule” (polymer) handling makes OpenFF intractable to use
JW: Software bottleneck is something OpenFF team should handle. Your workaround is probably something we shouldn’t need to do
DC: Interested in Espaloma result
MT – I see NAGL agrees better with AM1BCC than espaloma. But is there an experimental definition of “correct” that we can compare to?
- HT – Right, this is just an early exploration.
MG: These charges look surprising to me, I’d expect carbonyl charges to be <1
JW: Long-term, what sort of systems are you building? Put a ligand in a box with polymers …. what do you want to work toward?
- HT - Will look at diffusivity, solubility, properties of the drug molecule. And look at “points of hydrolysis” points on polymers in which polymers break down. Build a broken-down polymer and see how much differently the API behaves
JW: Do we think our force fields will handle these crazy chemistries well (i.e. low pH of stomach acid).
- MG: The idea seems to be to model the effect of the break down, not the process.
- HT: Yes
- MG (+ others): Would expect the force fields to do a good job. Haven’t looked at it specifically.
CC: How are you building these systems? Packmol?
- HT – Polyply, requires GROMACS inputs, but interchange gets me those. Coudl be used for other purposes but would introduce a gromacs dependency.
- MT – Polyply could be a good thing to investigate for packing in general, mabe could get folded into an example if it works well
J Clark: How you control the tacticity? Can you control tacticity wrt to different copolymer components
- HT: This is an input into swiftpol

Chapin: Protein force field update

We used to think we were going to use library charges and had some problems with them. I think now we should use NAGL and this problem is solved.
- Had some issues with charges on caps mucking with charges on residues a long way away from the caps
- NAGL does not do this. So I think we should ditch library charges and use NAGL in the protein force field.
  - Minor “issue” with NAGL is that some residues don’t have integral charges. But maybe this is okay since that’s really what happens in nature.
- JW: It’s nice to see (elsewhere?) that ff14SB and NAGL charges are usually pretty similar. But we have a philosophy of using high-quality QM for small molecules, is that really what we should be doing for proteins?
  - CC: (Mixed bag). Yes this is an issue
- JW: We messed around with implicit solvent but some of the QM data was bad.
  - DC: Issue was that we couldn’t run with diffuse basis set (cloud of smoke) this appears to be resolved
    - PK: This is a Psi4 issue
  - DM: Could always use better theory, but we’re currently using good theory. Can always revisit in the future
  - MT: Are you saying “I am switching over from library charges to NAGL?” If you use NAGL and it works for the protein force field, then every future generation of NAGL has to maintain performance on the protein force field.
  - JW: In some ways this is de-risked by the recent decision to treat NAGL like any other cfharge provider, where we can change the implementation in a new version and it will break performance of with older versions of force field. So we are already going to pin a force field to a NAGL model.
  - MT: The sooner that can be encoded, the better, so we don’t get confused
- LW: Do you anticipate have issues with charges that fluctuate a little bit between residues (i.e. not fixed per-residue)
  - CC: Could be an issue, but I think this is actually a GOOD thing
  - MG: I thought the variance was low?
    - LW: We looked at it and it was low. Only varies based on bond environment, not 3D structure
  - MG: Is this variance physically interpretable?
  - DM: Essentially NAGL has learned electronegativity and hardness
  - MG: Interesting, not necessarily scary. Unsure if it’s physically reasonable
    - LW: Variation is lower than AM1-BCC, higher (by definition) than template-based charges
  - CC: About generalizing from proteins to polymers. This could be an advantage of mixing charges in vacuum and implicit solvent as, e.g. in HT’s project one could tune or pick mixing parameters for particular applications
    - JW: sounds good as long as the FF is fit with a recommended mixing parameter and users can tweak that as necessary
  - JCl: We should benchmark any changes from recommended FFs. It might not be valid for the rest of the parameters
    - CC: agree. Might be of interest to the Matta group.
    - LW: think we’ll probably put out a FF with recommended settings and there will have to be a very high bar for users to validate their own projects before things change
    - JW: agree, …
    - CC: agree, just that our FFs will assume a particular mean polarisation aimed at our datasets of interest (primarily aqueous), this just allows people to fiddle around for their own custom data
CC: Issues with default ForceBalance settings when treating the protein just like any other molecule. Protein parameters didn’t really change much. Resolved this by heavily up-weighting protein data and plan to do this for the “first” protein force field. Think we should turn this off for subsequent force fields
- MT: How does smee relate to this?
- DC: Might see similar issues if using smee. Not sure. Lots of remaining questions about how this would work (data sets, weighting of different reference data)
- CC: Tried this out, performed okay. Found that dropping OpenMM minimization (FB TorsionProfile target to FB AbInitio target) was faster but less accurate, but using a custom objective function inspired by ff14SB was faster and equally accurate.
- Some issues with the parameter surface for polymers vs. small molecule QM fits
  - Topic for spin-off research: use different optimizers for these fits
- …
CC: Plausible to me that some decisions around periodicity that were made for ff14SB aren’t appropriate for OpenFF. One of the mutations in the genetic algorithm involved randomly changing periodicity. Another is changing phase from 0 to 180. Doing this would be interesting. Might help small molecule parameters as well
- TG: Is the trick that Amber only allows positive k?
- CC: That’s true for periodicity of 1. Not 2+
- DM: Does this related to Danny’s observed issues with periodicities with smee?
  - DC: Kinda
- CC: We could start with many periodicities, do LASSO regression to prune terms that aren’t helpful, then do a second stage optimization with the non-zero LASSO terms.
- MT: …
- CC: was bringing this up because not sure how much the optimization strategy is adequately traversing the landscape. Any gradient based optimizer would likely have the same problem.
CC: Benchmarking. My strategy is to look at bigger systems with more degrees of freedom. Drawback: More expensive to run simulations, get observables. Trying to shortcut this by running different tiers of benchmarks
- splitting out into “quicker” benchmarks that short-circuit large issues for triage but longer-running benchmarks for the “whole” benchmark set
- some larger proteins (MD is still slow)
- some smaller peptides (quicker?)
- Think this hierarchical setup is a good candidate for other polymer systems
JW: (recaps packmol discussion from yesterday) do you have a wishlist of features in system set up for benchmarking?
- CC: treatment of protonation states. I’m using pdb2pqr now, but it’s optimised for proteins. This matters for benchmarks. pKa prediction for small molecules (and large?) would be great.
- DM: that would be a huge project that would probably not be within OpenFF/E scope.
- JW: we can streamline use of an existing tool like pdb2pqr
- DM: I think OMSF is going to own pdb2pqr temporarily
- CC: system set up will be very system dependent.
- MT: what’s the audience for this? Public-facing or expert users? Can build specific stuff more easily for internal use
- MG/JW: clarification on that protonation states must be loaded into the tool.
- DM: Our PDB loader hopefully makes it easier to use external protonation state tools since they do weird things with hydrogens and atom names. And users can define these themselves
  - JW: There will be some shims for very-not-in-the-spec things that users want to do
CC: General point: for protein benchmarks, need to run things as close to experimental settings as possible. Temperature, pressure, pH, salt concentration, (for membranes) surface tension
- JW: For you (CC) we know you can do the system prep well yourself. Not always the case with external users who need to do wider use cases. Hesitatnt to bake stuff in until it’s really part of a final study
- DM: Some other context-dependent parameters, like where exactly you put the ligand
- CC: I have some code that does some of this
  - Basically a big dictionary of metadata
  - JW: Wonder if this setup would useful for Hannah? HT: Sure
AF: Did benchmarking on a bunch of proteins. Kept seeing secondary structure issues. CC tried getting around this by fitting to NMR. We tried (something with the PDB). Developing hypothesis that the issue is training data, not getting enough helical turns and some issues with inter-residue contacts. Doesn’t seem like the issue is how things are being fit
- MG: Status of new QM data?
- AF: Waiting for new QM data to be finished.
- AF: We picked a few thousand structures that represented diversity in the PDB. Something about getting a bunch of torsional something. Looked for simplified sequences ?. Selected 200 unique 4mers with a good diversity in torsional space. Conformers centroids of torsional space. 1000 conformers from 200 sequences. Had to write code to cap without messing with torsions. This dataset is optimizations, not torsion drives.
- How long will this take? J Clark: 10 days
- MG: … seems excited. Some ideas about improving quality of QM data, how restraints are applied or something
- … timeline of adding this into a refit?
- CC: Could add them in immediately

Polymer discussion

JW: Slide(s) on defining common terms

Build topology/coordinates
Ingest into OpenFF
Parameterize & output to MD engines
Run simulations
Analyze
Build a polymer FF

AF: Should we start benchmarking polymers? Have somebody in Shirts group that could do this.
DM: General attitude is that we don’t have resources to work on stuff, but would be happy to see new data

JW: What’s going on with MuPT?

TB: Lots of stuff. Want to use atom-typed force fields, custom residues, …

JW: Can load polymers if you define monomers with residue name and atom name and some other stuff. This uses openff-pablo. …

JW: What’s the biggest bottleneck in polymer simulations?

JH: Equilibration, but only because dynamics are slow. This is a force field, not an infrastructure problem

HT: Molecule de-duplication issue I reported earlier

MRS: context for polymer discussion (since I’ll be teaching)

MuPT - funded collaboration for general tools for setting up polymer simulations

This covering all soft materials (including handling both CG and AA), but definitely includes polymer/protein interactions. Relevant touchpoints with OpenFF.

Box of proteins + polymers
Crosslinked proteins/polymers (PEGylation, attaching fatty acids)
Glycoproteins
Bringing in PDBs from however they are generated and parameterizing
Turley/Matta formulations project also shows utility of polymers in drug design
We want to support non-OpenFF force fields as well, via Foyer

A lot of these MuPT needs overlap with tool requirements for OpenFF. Proteins/nucleic acids/glycoproteins are a subset of polymers . . . What are those overlaps/opportunities for shared tooling? DISCUSS!

Afternoon

Meghan Osato: Partial Charge Variability

TG: When you report max partial charge diff, is that on a single atom?
- MO: Right, the value reported for the molecule is the highest value of any atom in the molecule
JC: What’s the history behind this problem?
- MO: We found several papers in the literature that documented similar problems, e.g. on sugars
- DM: Can you explain how ELF works?
- MO: (explains ELF) essentially it looks for molecules where the molecule is most spread out
- DM: CB had found that AM1-BCC doesn’t generally have much conformational dependence, but it can when there are strong internal electrostatic interactions.
CC: Why did NAGL have any variation at all in free energies?
- MO: That’s the baseline that comes from other sources
TG: Were you aware whether AT was using a threaded or single core library?
- MO: Not sure
- JW: The toolkit doesn’t change AT settings from default
- TG: You could set #threads to 1
DM: The big picture is that these are more reasons to use NAGL.

Jen Clark: Dataset Archival

JE: Each entry in the list corresponds to a single molecule?
- JC: Correct
Proposal: Just store the SQLite file on Zenodo
(room consensus) That sounds great
CI: if QCA makes changes to spec, would you be able to import those directly? Would it have to be migrated? Also, is the idea to back it up or to make it easy for others to use? It’s trivial to dump and store sqlite files, but you could have another file format that’s easier for people to use.
- JW: we don’t have a lot of clarity of use cases on what format would be most useful to us today. But inevitably we would have to convert sqlite → qcfractal → endpoint.
- DM: our most obvious use case is that when we make a release, it relies on data in QCArchive that may not always be present and may get retired. We need some place to put it.
- LW:
CI: with the QCArchive records, these can more or less be dumped to a dictionary, instead of relying on internal schemas (which are just pydantic models). Maybe dumping this into dictionaries but could be driven by the code.
- JCl: the dict representation is what I was referring to with a JSON file. The sqlite contains this but with redundant info removed. There’s definitely a higher barrier to access this, but you don’t need anything from molssi to use this file, just an understanding of how databases work.
- JW: Given that Zenodo is cheap/free, could we add both representations?
- JCl: there’s still an issue of structure. Converting it to a JSON file meant I had to invent my own schema.
- JE: to understand the best way to use this we also need an understanding of the use case.
TG: I really like this. This is also language-agnostic, I could write my own queries with e.g. C, would be much faster. Will the sqlite database have the same representation as the postgres server that molssi actually uses or is this a simplified version?
- CI: I believe these are the internal representations – so they’re the QCFractal representations, so you would need QCFractal to understand them.
- JC: No, you can query it independently.
- JW: can you query the molecule info without QCFractal or do you get bytes?
- TG: you get arrays. The schemas are typed.
JCl ~2.47pm PT: awesome live demo
- TG: this looks much more friendly than the internal postgres
JW: The compelling case against sqlite and JSON would be that there’s no agnostic way to export to JSON…. is that correct?
- JCl: yes. MolSSI have schemas for individual records and objects, but how you piece those together with the metadata is not agnostic.
JE: is it correct that the problem we’re trying to solve that data will not always be on QCArchive, i.e. may disappear?
- JW: yes, that’s one of the major problems about this
- LW: also, not having to install an entire software stack to use some data
- DM: options seem to be either download the software stack, or pull the data ourselves, or we put effort into fixing both and there’s no clear current use case
- JM: wondering about the BLOB types, are we sure we can decode the BLOBs without software?
- JN: that’s possible.
- TG: that looks like a binary blob.
- JW: if it is a blob and you need QCFractal to decode it, does that impact our conclusion?
- (General): is there a spec that would define how to decode the blobs?
  - JW: unlikely, this is a very recent feature
- JM: one solution is to store a Docker image
- MG: or a reader
- JW: a Docker image sounds like a good image. There’s probably a whole discussion to be had about the best format, e.g. about the image. I propose we short-circuit that by just doing a Docker image and waiting for someone to complain
- MT: this reminds me of discussion of MD simulation reproducibility; it could go on forever.
- DM: we should do the simplest thing. If someone comes along later with a convincing argument, we can do that instead.
JM: …
JC: if they’re binary blobs, instead of doing a Docker image we could replace them with JSON strings
(General): ask Ben about blobs
CI: if the main goal is just data isn’t stranded on QCA, publishing SQL on Zenodo seems to solve that, even if you need to install QCFractal. Other discussion of formats could be more of a user problem, adding examples would be helpful but you can’t solve everything for everyone.
JW: agree, this is a good option.
(General): agree the below sounds good:
- SQlite + docker image
TG: I’ve been using the DES dimer dataset and that’s a big CSV file. They provide a helper Python script to get xyz files.
DM: sounds like it requires human time to understand the dataset. Using the existing format like QCSchema means the work is already done.
TG: CSVs are immediately readable
JW: CSVs sound good to me. If we were willing to put in the work, I think CSV would be the way to go. But sqlite is free
LW: Agree, pros/cons of CSV sound the same as JSON to me.
(General): anyone else released an sqlite database? how did they do it?
- (General): not that we know of immediately. It’s only recently people put effort into releasing data at all.
- CI: I’ve seen a lot. XYZ files, XYZ + CSV. Even CSV files, which are much easier to work with, once you add CSV with many other non-standard files, that gets annoying. Most datasets that aren’t awful to work with have been hdf5. There’ve been some database formats, but they’re all different. I worked with lib-something that was meant to be lightweight and you could go through entry by entry. In general one unified single format with standardised keys is better than a random xyz or text file. SQL is totally fine; should be fine to provide examples on how to extract data.
- DM: https://xkcd.com/927/
- CI: for a lot of things, it’s just that there’s no good examples currently on how to present data in a way that’s easy for people. People know how to use XYZ files so they use XYZ.
JN: A few things. From my perspective, sqlite is pretty common for databases to be distributed in. Right now the sqlite docs is under the caching section of the QCArchive docs, so that’s likely the intended use and the reason for the QCA objects vs general representation.
- https://docs.qcarchive.molssi.org/user_guide/datasets/caching.html
- JW: BP did make that clear to us, that reading it was dependent on the QCF version used to make the cache.
JN: Also, CSVs are maybe too simple to represent all the complex data available in the QCArchive with all the properties, etc.
PB: will this happen for all datasets on QCArchive?
- JCl: just what we use for released force fields. Currently the JSON representation we publish requires pulling records from the server to recapitulate the dataset.
- TG: is that a QCSubmit or QCArchive collection?
- JCl: QCArchive collection.
- JW: we did have a conversation with BP about dataset lifecycles where datasets would start on hot storage and maybe eventually move off the server and onto permanent end-of-life storage.
- PB: why not do it for all the datasets we submitted
- (General): discussion about releasing FB targets
JCl: my action items are:
- Docker image
- Talk to Ben about interpretability of BLOBs
- Follow up with NIST contact on storing these as SQLite

Day 4 Notes

Afternoon

Meghan Osato: Partial Charge Variability

Jen Clark: Dataset Archival