2021-05-13 Force Field Release meeting notes

Date

May 10, 2021

Participants

@Pavan Behara
@David Mobley
@Christopher Bayly
@Simon Boothroyd
@Hyesu Jang
@Jessica Maat (Deactivated)
@Trevor Gokey
@Jeffrey Wagner
@Matt Thompson
@Daniel Cole
@Mateusz Bieniek
@Joshua Horton
@Lee-Ping Wang

Goals

Discussion topics

Time	Item	Presenter	Notes

Item	Presenter	Notes
Internal coordinate hessian fitting update	Hyesu Jang	https://openforcefield.atlassian.net/wiki/spaces/~868974935/pages/1726513365 HJ: Vib freq to IHC, from sum of sqrd distances of frequencies we will move to comparing hessians from QM and MM Fit1 and fit2 where I have (td+opt+vib freq) versus (td+opt+hessian) In the last presentation I showed there is a difficulty with the rearrangement method Here is a new alignment method which used the internal coordinates and RMSE falls down between QM & MM freq Method is good but the difference in fits is moderate. CB: What’s your conclusion? HJ: Fit2 was slightly better CBy: So, from the standpoint of the objective function, the improvement is marginal? LPW: In the second half of this material, it was mostly showing a validation test. But in part B, it was very similar to fit1, where there was no preassignment of modes. My interpretataion is that I wasn’t expected the internal coordinate hessian fitting to be so similar to vibrational frequency fitting. CBy: I think that the results I’ve seen have shown a dramatic improvement in the quality of the fit. This is probably because I’m looking qualitatively and seeing less outliers. HJ: The plot you’re looking at ISN’T comparing fit1 and fit2. Instead this is showing how much things improve when we do a better job of matching the modes. CBy: Hm, my original point may still stand. It looks like fitting to the IC hessian is a total win. SB: This seems like it has potential to be a huge win. In 95% of cases, things were fine to begin with. But in the 5% that were way off should be greatly improved. So I’m not sure whether a global analysis will really show the results of doing better on this 5%. Maybe a deeper dive is needed. DC: What sort of overall accuracy should we be aiming for? 100+ wavenumbers is a lot of error. In our fitting we generally get 50 wavenumbers. LPW: I’d expect that the quality of the results will depend on the ratio of training parameters to observations. The work that our different groups do may be in different regimes in that space. When I was working on protein FFs, it we’d normally see wavenumbers start in the range of 400 wavenumbers and drop to 200. CBy: In the cases where you get better agreement, is that in a bespoke case, or in a general case? DC: Our experiences are more in the bespoke regime. https://pubs.acs.org/doi/10.1021/acs.jctc.7b00785 CB: I was advocating to take out vibfreq from our fits because it is so local and I still prefer to use HJ’s method LPW: I think that there’s a difference in how we do validation that may be affecting how we see/measure these results. CBy: Could that be affecting our entire approach to fitting FFs? LPW: The problems that HJ showed me mostly had to do with assignment of modes (HJ shows bottom of confluence with normal mode diagrams) HJ: N#N stretch may require a larger spring constant. This may be related to C#C issue that was raised before. CBy: What this slide tells me is that there’s a mis-assignment of modes. (connection issue duringthis comment) LPW: I think what we’re seeing here ISN’T that the assignment of modes is casuing the fundamental issue on the left. What HJ told me if that, when HJ manually reviewed the models, none of them were just a N#N stretch. They were always coupled to other motions. So any attempt to fix this will need to fix this normal mode detection. CBy: Will IC hessian fitting fix this? LPW: That was the hope, but what we’re seeing is that this hasn’t happened. One of our hypotheses right now is that the initial value of the bond `k` parameter was too low, so it can’t be increased as high as it should be. HJ: I agree with LPW. If the N#N k value needs to be 3000 kcal/mol A^2, then the initial value will not be able to resolve this. I think we should try setting the initial k value higher. Later in meeting, in chat: JH ran bespokefit on this molecule, got 1468 kcal/mol A^2 (consistent with 3000/2) LPW: Before we do that, I’m curious whether the SMIRNOFF format already has a factor of 1/2 for bond k values. JW: The SMIRNOFF format does not take `k/2`, it takes `k`. DLM: IIRC only amber drops the prefactor of 0.5 CBy: It’s possible that we may need to fit “true” k values for these sorts of triple-bonding situations, but then to reduce the value of the final k artificially to make it suitable for MM simulations. LPW: IIRC, that problem wasn’t caused by the C#C bond itself, it was a neighboring single bond that did it. I’m optimistic that training to smaller fragments will give us the correct emergent properties. CBy: Regarding the force constant, what one could do is have FB NOT fit the force constant, but rather have it fit sqrt(force_constant), which could keep things from getting out of hand. LPW: I think that’s a good idea DM: Didn’t we do that before? LPW: We did for impropers CBy: LPW had previously said that, when the vib freqs are so far off (due to misalignment), then their information is so local that it can’t be used in a global fit. SB had mentioned this as well in a meeting last week. Is there some wa to know when this is the case? ie, is there some analysis thatwill tell us that training to vib freqs will work in a certain situaiton? LPW: Would one solution be to fit a set for force constants to some mols/data, and then test whether it does a good job on other confs/mols? CBy: (could’nt follow) LPW: I think having internal coordinates that touch a lot of parameters is an important part of doing successful FF training. And I think that if the starting values are off by a factor of 2 or 3, then we also may be unable to do a successful training. I wonder if we could start with one of DC’s bespoke fits as a starting point, and that would made us less susceptible to bad initial values DM: This may also be informative to help us identify when a parameter lumps too many molecules together DC: We were just talking about something like this today DM: Do we have subsequent experiments that would help identify this? Like Whether a larger starting N#N k value will fix this problem (something else)? HJ: Sounds good LPW: What I’m wondering about is how many other manifestations of this problem are out there, and whether this will fix them/how we’ll find them. So a decent longer-term solution may be to come up with a different set of initial guesses. DM: Another idea is to change the objective function to do a better job of identifying fit2 as an improvement – Maybe more heavily penalizing outliers. LPW: It was nice to me to see the existing objective function reporting that the IC hessian fitting was getting a better score than the vibfreq fitting. This eliminates one of the big sources of error when we’re doing something really complicated. So while it has some immediate effect on the quality of the resulting FF, it removes one more possibly super complex source of error, and makes the result more debuggable. LPW: I was thinking that one benchmark that we could do would be something lik “running ab initio MD and MM MD and getting the power spectrum (like an IR spectrum but without the dipoles) from each”. Ab initio may be a bad idea because it’s so expensive, but something like this would be good. DM: Or it might be the case that we’d see improvement in something as simple as relative conformer energetics. SB: Is this available in FB to play with, or is it in a development branch? (General) – It’s in HJ’s loval branch. HJ – I’ll open a PR for this.
Automated fitting infrastructure	Simon	SB: About a year ago, I build a new pacakge called “nonbonded”, which curated physical property datasets, and sets up+runs sims. Over the past month, I’ve done roughly the same thing for QM data+fitting. (Shows curate-training-set.py). I built a similar thing for QM valence fits where we build collections and have filters hydrogen bond filters, steric filters, etc., Also, I am storing the record ids so that we can pull it down easily again CBy: Re: Immutability of datasets – Could we make a version based on when it was pulled down. SB: One of the things with versioning is that, when you submit a dataset, it “is” versioned. but the different dates will give different results. Another issue is that, making a version each time we do a potential fit, we’ll end up with a lot of versions and a lot of complexity. SB: I’ve schema-ized everything that goes into an optimization. (shows `setup-vdw-v1.py`) In my CLI we can directly create forcebalance inputs for different test fits analysis of the results too openforcefield-forcebalance is doing it on the QC side and this would be an improvement over that, where a single pythonic command can create FB inputs JW: Where do you have the datasets stored? Do we get to generate the tarballs? SB: Right now it’s on a database on heroku. When we do an important fit, we can pull it down locally and make a GH tarball like we have before.
SAGE current progress	Simon	Slides here First round of LJ refits have been completed JW: Are the training and test sets separted? SB: these RMSE s are only on the training data CBy: On enthaly of mixing slide – Alcohols are the only donors in these mixtures. How are we doing on other compound classes like amines and amides? SB: We say some improbements with amined+amides, but nothing amazing. There’s improvement, but still large absolute error. CBy: Amines are problematic since they’re also a hbond acceptor. And the exptl data is hard because of their basic-ness. Do we have amide-ester or amide-ketone or amide-pyridine mixtures? SB: I don’t think we have enough info on those mixtures. We have big problems with data availability. CBy: Also curious about the multiple-minima-in-parameter-space problem. SB: I think this problem still affects us. One thing that makes me cautious is that our current vdW parameters look a lot like the final reduced set from the Gilson lab work, we just haven’t reduced the number of hydrogens. But I think we primarily need more experimental data before we can definitively try changes like that. (on “tertiaty amines got worse” slide) JW: OE at AT partial charges? SB: OE SB: I see degraded performance when fitting to gen3 torsiondrive set HJ: What’s the test set? SB: The same as the Lim+Hahn paper SB: I think there might just be some tricky molecules in the gen3 set that need to be filtered out HJ: Another possibility is that the torsions in the molecules were refit using…? DM: Some other possibilities are It’s possible that the benchmark set from the Hahn+Lim paper isn’t great. A lot of the molecules in the dataset were inspired by the same method used for making the 1.2 training data HJ+SB: Agree. But one would expect less significant degrataion in that case DM: SB used the term “Significant degradation” – Do we have error bars for this plot? SB: I’m coding up error bars, but they’re note ready yet. DM: What would happen if we took the 1.2/1.3 training data, and supplemented with some amount of the simple training set? SB: That’s in progress. CBy: I’m wondering if we’re seeing some of the consequences of apples- and oranges- in fitting data. So for vdW, we’re fitting the well depth and the “far away” interactions. Then in the torsiondrives we’re in the “oranges” where were looking at the performance on the short-range, “steep” part of the potential. SB: Agree. The interplay between these two regions is hard. But I don’t think this is the case here. We designed the gen3 set to NOT have strong sterics or hbonds. I’d like to make some sort of analysis of which part of the LJ wells our datasets are generally residing in. CBy: A few months ago, you suggested refitting 1-4 scaling factors. SB: In the experimental fits where I tried varying those, we haven’t seen a major influence. I’m planning on continuing to test this. DLM: So, you initially worried about geometries after LJ refits but they don’t go bad right? SB: Exactly CBy: So, we still are not using ICH yet, so I think we may need to see how it changes these results SB: Yeah, I am going to do that JW: Two Qs: Do these plots include vdW refit? SB: yes Would it be necessary to delay a “final” sage release until IHCs are incorporated, or could we make a 2.0.0 release with the green line on this slide, and then a 2.1.0 release a month later with the IHC. This may relieve stress/internal pressure to rush IHC fitting into a production/a ff release DM + Cby: Agree SB: Possibly. I’d like to do a closer study of whether we have significant degradation on any funcitonal groups. SB: I don’t think we should release the green FF immediately. There’s other data that shows that the performance gets worse in, for example, binding FE calcs. DM: Could you present this to advisory board next week? SB: Yes DM: I have a lot of skepticism about whether we can really get useful info for a go/no-go decision using a single protein target. CBy: One thing that’s special here is that this is the first time we’re mixing a vdW refit of our FF with a protein FF with a different vdW set. And especially, this metric (binding free energy) is very heavily dependent on the vdW terms. So one thing I’m thinking is that we shouldn’t release a vdW refit in Sage without being able to refit the protein FF. SB: That’s a good point. I’ll share a bit more of my rationale for including this: This isn’t a test, it’s a validation. The real reason I wanted to include this is to ensure there’s not severe degradation. And we need to ensure we don’t overinterpret this info. CBy: Agree that we may be overinterpreting. The general issue, though, is that when we do something that’s really significant (like reparameterize the vdW) only for the ligand FF, we would reasonably expect there to be issues. CBy: IF we see general degradation in protein-ligand binding, then the advisory board would be saying “when do we get a corresponding protein FF?” DM: If we hold off on LJ refit until we can refit the protein FF, then we have to answer why schrodinger can refit OPLS without changing the protein FF, or GAFF can be refit without changing the protein FF. CBy: Yeah, but we should have a plan for when people ask what we’re going to do about parameters going out of sync. I don’t think this should hold back sage. SB: Agree with basically everything above. SB: One direction that we’ve considered is to do something similar to the water model. For that, we’ve forced TIP3P parameters into our mixture simulations, and so we know that it may be putting deficiencies into our FFs, but at least its consistent. So we’re thinking about doing some mixtures with amino acids parameterized using existing FFs, but there is a lack of data there. JW: I think we need to move forward without complete info on the protein-ligand results. SB: Agree. We can probably use Lilac to do more of the JACS set, but we don’t really have a route where we’re guaranteed to get a significant result.
Sulfonamide debug discussion	Pavan	Test fits PB will do major presentation of this tomorrow SB: It looks like the ring geometries don’t affect the equilibrium and k (General) – Agree JW: I wonder if certain targets are leading to systematically high/low values for angles/ks because they’re resisting restraints HJ: That’s possible SB: Looking at the results of my fits, I see a few things: excluding vibfreq from fits reduces k for a30. I see a decrease in k whether or not I include vibfreq. TG: Was a penalty(/normalization) used for this fit? PB: Yes, L2. TG: I think we’d expect a difference here based on the choice of initial values. SB: So, if we took off the prior, and used the most permissive prior we could, then we’d see the effect more clearly? (General) – It would be more informative to redo these fits, completely removing the priors. TG: My trick for this is to set the L2 penalty to -1, which makes it cancel itself out. SB: What if we set it to 1000? TG: There’s still a penalty term that would be contributing. HJ: I’ll need to think about the best way to support this. SB: So, let’s try setting a penalty weight of -1 for now. TG: You could also check the penalty value and see if it’s large relative to other factors in teh fit HJ: We do have a way to use the mvals instead of pvals – That should solve the problem TG: One assumption I make is that we don’t resue mvals from previous fits. (General) Correct HJ: I think we should reuse mvals in this case if we want to keep penalty size consistent. TG will post the workaround as a comment in PB’s confluence page above. This discussion will continue in tomorrow’s call.

2021-05-13 Force Field Release meeting notes

Date

Participants

Goals

Discussion topics

Action items

Decisions