Page Comparison

Participants

Recording: https://us06web.zoom.us/rec/share/fSuEfOUIGy5pCoULGojII3t0Ax38wkVl9N0pY6TeB7KdavGZ3AWgzY__p5KaAlPE.83O6K6Xp_tBBgMYz

Passcode: 93JX^gnM

Discussion topics

Item

Presenter

Notes

Re-fitting with additional S data

AMI

View file
name ff_fitting_10232024.pptx
JR: I’ve heard people talking about needing extra sites to model the quadrupole of sulfonamides, so it may be an electrostatics issue
- AMI: Yes, P and S aren’t super well treated by AM1-BCC, so we’re definitely considering electrostatics involvement
JR: are the P molecules charged?
- AMI: some of them
- JR: I’m always somewhat dubious about dealing with charged molecules in gas phase
- AMI: QM structures look reasonable. I don’t think that’s the issue here. MM structures optimizing to crazy structures is more of the issue
- JR: so don’t see any H migration from another part of the molecule?
- AMI: no, we filter that out.
BS: surprised that adding more sulfur data degraded the P behaviour. Would have thought they were decoupled
- AMI: I think it’s mostly unrelated. Some of the S data also has P in it, we are expanding that dataset. If my hypothesis is correct and it’s due to the force constant being too low, the data may change what it’s optimizing to
LW – You’d mentioned that a lotof the issues with the phosphonate/sulfonate issues were mols with ?? - We have quite a lot of phosphorous training data, right?
- LM – It’s about 10% of our training data. One angle has 7k data points, another has 500.
- LW – So one mol could contribute2 training data points for different params?
- LM – Yes
PB (“Sulamides and sulfonamides” slide): It looks like (2.2.1 with addl data) is similar to 2.2.1?
- LM – Yes, I think the sulfur problem was fixed by adding the new data (“fixed” short of changing electrostatics)
- PB – so what’s the problem?
- LM – … for example 4-membered ring params optimize far from MSM estimates, not sure why. Maybe related to ring strain?
- PB – Still confused as to whether there’s a problem with restricting the prior to smaller values?
- LM – We’re basically just training the torsions at that point. In the context of training a larger FF we didn’t want to treat the MSM values as gospel, and previously we weren’t seeing values change much from MSM starting points.
- PB – Also, you’re introducing two changes here - changing priors AND adding new data. Could those be separated to see independent effects?
- LM – I could do that.
- LW – Would you expect to see much difference in the tighter priors of 2.2.1 didn’t allow it to change much from MSM values?
- LM – The MSM values might change, but I don’t anticipate that they’d change much from the new MSM value. Though we do have a lot more sulfur data so maybe the parmeter’s starting point is changing. So maybe I should look at how much the starting points change.
PB – Did the new fit help with 4-membered rings?
- LM – The “Luckily, small rings…” slide shows either no change or slight improvement. The change may be due to heterocycle inclusion.
- JR – Heterocycles all 4-membered? or all 5- and 6?
- LM – 4-membered. Some predicable issues given that they don’t have specific 4-membered ring params, like CNC ring angle includes training to 5- and 6-membered rings.
JR – Would any of these be affected by 1-4 or 1-5 LJ (do they have any influence whatsoever, good or bad?)
- LM – These are mostly confined to one angle, so I’m not sure if 1-4 interactions are coming into play.
- JR – The atoms will be closer together in a 4-membered ring, and the ring will be strained.
- JW – We always use the shortest path when applying 1-4 scaling - so in a 4-membered ring an atom is its own 0th neighbor, and it is excluded from nonbonded interactions.
- LW – Right, in a 3/4/5/6-membered ring we wouldn’t be computing 1-4 interactions.

Lipid force field refit

JH

Working on refitting sage to better modellipids. Been getting some help from science team, but have some additional questions –
- With new torsiondrive datasets, we’re adding 78 new alkane and lipid specific headgroups. How much QM data is enough to extend to new chemistries? Or is it an iterative approach, and if so how do we know when we have enough?
- LW – I don’t think we have a systematic answer. With our current valence FFs we try to ensure that each parameter has sufficient coverage and overlapswith real world mols that we care about. We’ve been working recently to expand diversity to fill gaps in our coverage. BW used fingerprinting to ensure diversity when making lipidMAPS dataset, so that’s an example. Are you looking for more of an answer in the form
- JH – Talking with BW, we’d thought about generating larger datasets for headgroups, eg ionazable and synthetic lipids. So wondering about being on the same page about process of making lipid FF.
- DM – Agree with LW that we don’t have a systematic answer. Usually we look at “in how many mols in this parameter used?”. For common chemistries like alkyl tails, unless you’re creating new parameters, you owonj’t need to add more params. But for headgroups you might need to add more parameters, and ensure that there are 5+ different mols with that chemistry.
- LW – 5+ is a good goal, or higher. Though inour current workflows we’re still trying to get each parameter period benchmarked.
- DM – On the QM side, if you imagine that you might someday want to test on a certain set, then you should go ahead and submit it. Even if you just want to useit for benchmarking.
- PB (in chat): Some history on Gen2 sets, Design of Second Generation Torsion Dataset
JH – When you create separate sets for training/testing, how do you split things up?
- LW – Currently, when we make training/testing splits, we try to ensure coverage in each set. There’s also a question of dataset types, since we train on both opts and torsiondrives. Currently we’re only benchmarking on optimizations.
- DM – It’s hard to predict what the split will be ahead of time. But one good way to split is to keep smaller, less complex mols for training, and larger floppier mols for benchmarking.
JH – For validating after I train - In terms of validating - there aren’t many best practices for lipid validation… Some people benchmark on modular PARTS of lipids. Also any tips for experimental-ish benchmarks?
- DM – My initial answer is that we aren’t the right group, and my initial plan was to have someone start working in this space, and then convene the right group to determine the right answer. If you get some momentum in this space then David Lebard at OE, (and others) etc would love to give advice. I just didn’t want to convene a group until we had someone looking to start on this.
- JH – That would be great in the future. MShirts was also looking to convene a group for this, was looking for advice before I go to that.
- DM – I think the single best person with actionable advice on this would be David Lebard
- CC – My experience with protein FF has been:
  - Because we have experimental data on lipids, get that into benchmarking pipeline as soon as possible. In the protein FF project we’ve found that fitting well to QM often leads to WORSE agreement with expt
  - It’s a good idea to benchmark in tiers - to basically have cheap benchmarks go first before you kick off more expensive longer-scale benchmarks.
JH – When I’m benchmarking, oneof the big problems with sage lipids is that they have slow dynamics/kinetics and structural deficiencies. Would one be better to fix first?
- LW – I think this one is best delegated to the experts.
- PB – Followup Q to thegroup - Are we still aiming for a unified FF that does small mols and lipids, or are we OK with a lipid only FF?
- DM – I think unified but we’re willing to unify later. I use the term “lipid FF” but what I mean is an extended version of Sage to BETTER cover lipids.
- JH – Agree, same here.
JH – For determining SMIRKS patterns hierarchically - Any insight on when to split a parameter? Are there people who have been working on this?
- LW – BW has been splitting parameters based on multiplicies, possibly not as general as you like. TGokey’s BeSMARTS work is an automated way to do parameter splitting. LM has been doing manual parameters splits though.
- LM – I usually split params when I see them covering very diffrent chemistries - either looking at mols, or seeing andlges that have different values in real structures from their parameter’s equil angle. I have a tool to visualize which mols/angles they correspond to. Eg there was a paramter that I split that applied the same torsion parameter to both planar fused conjugated rring systems and to spiro compounds. But generally looking for big differences between parameter equil value and real/QM structures.
JH – I’m meeting with a group of lipid experts in 10ish days - I’ll use this to refine the questions I put to them. Thanks!
- .

Versions Compared

Old Version 1

New Version Current

Key

Participants

Discussion topics

Action items

Decisions