| |
|---|
Current takeaways from Finlay’s work on smee + bespoke and Jen’s work | Training to SPICE with linearized bonds and angles, and Sage torsions starting from 0, gets performance close to Sage, with a set of validated hyperparameters (learning rate and so on), in 48 GPU hours, or 6 hours with mini batching There’s much interdependence between linearized harmonics, scaling, learning rate, etc. A rigorous comparison has not been done.
Automatically generating torsion SMARTS improves performance from Sage, although it results in an order of magnitude more parameters Automatically generating torsion + angle SMARTS improves performance again, but adds substantially more parameters (depending on the specificity and features included) Bespoke work suggests the need for optimized geometries (which Jen’s work is also currently backing up) Finlay questions Jen questions DC: we saw a lot of issues with parameter drift over time, so angle and improper parameters would drift over time even though the energy loss was drifting LW: were you regularising? FC: Not in general. I played around with regularising impropers. We saw that the impropers are very tightly coupled to the angles, especially the eq values drifting off
FC: also, the linearisation is supposed to balance between the force constant and equilibrium values. On the impropers, they are consistently improved no matter what we do, even if we train with Sage types. LW: some of the equilibrium values are going to 0 in Jen’s fits which is worrying
|
Aligning goals | |
Open questions (not necessarily answered by this work)
| Optimized geometry data – where do we get it? ideally all our data is at the same level of theory (and ideally for evaluating workflows using current OpenFF benchmarks, it’s at the benchmark level of theory) SPICE: 1+M conformations will be expensive to recalculate Could we use the OMol25 dataset? They claim to have a) recomputed the entirety of each community source, which includes SPICE, and b) optimized the GEOM dataset. Does that give us enough coverage of the space? Not sure what elements are in GEOM Note – no Hessians in OMol25, though Experiment: combine SPICE data in OMol25 + a selection of the optimized GEOM data and see where that gets us? LW: had a look at GEOM data in OMol25, looks like it covers all the Sage elements DC: note dataset filtering is important to remove unconverged QM / very high energy points.
Can we extend the torsion specification approach to bonds and angles (and ideally cluster parameters afterwards?) (Jen’s project) Is it practical to try to cluster the torsions? (FC) It’s easy to extend to bonds + angles (and shouldn’t be too hard to extend to impropers). I’ve run some fits with fairly specific angles (and not very specific bonds) and torsions which improved performance on the industry benchmark again compared to just specific torsions. Should we extend to impropers? LW: IMO yes! Our impropers don’t do well (see Sage 2.0 paper, also someone gave a talk about this a couple years ago…)
How do we unify this approach with a workflow where we may want to keep some parameters frozen? i.e. particularly alkane and/or protein torsions a) is this firstly necessary to keep alkane torsion profiles correct? How does existing smee-spice FF do? b) Could we ensure we have fairly specific types for these cases then just exclude them from the relevant parameter config during training? We avoid training linear torsions currently (FC) JC: if you defined alkane torsions as just [CX4][CX4][CX4][CX4] that’s quite generic, but Finlay’s recursive smirks also has specified hydrogens on them LW: are the Hs in the central bond? FC: least specific was just central bond, most specific had everything neighbouring all atoms). So most specific would distinguish between CX4H2 on atom 1 and CX4H1 on atom1. It does fine in the low energy areas for the alkane data, the energy barriers are a bit off. DC: ideally you wouldn’t need a wizard to come in and define what things need to be kept constant between FFs. You could just have protein and lipid targets.
Can we improve on the current hierarchical approach for generating specific SMARTS patterns? Continuous typing is likely the future, though seems sensible to have a shot with this approach given relatively low effort with SMEE. Other training schemes are possible, for example hierarchical training where we initially train on all data with the least specific SMARTS, then repeat training with more specific SMARTS added and some regularisation on less specific SMARTS etc. May be over-complicated. LW: is this for regularisation or to make smirks that smartly becomes more specific? FC: the issue we were having were less specific smarts that didn’t have a lot of data. I wondered if I could initially train everything with non-specific types and regularise at each stage to the one before. This way we can avoid the smarts with very little coverage. JC: I had the same thought that if you go too specific to start off with that’s pigeonholing yourself to what is represented in the dataset and you’ll do poorly in something that hasn’t seen before … on the “too many types” question, this may depend on specificity levels. So we have multiple specificity levels fit FC: we could try just having two levels, one very specific and one not, see how many parameters we end up there
How many types are too many types LW: good question… one specific concern with torsion types is ill-posing, e.g. torsion A and torsion B are always fit together when training so no unique solution, but can arise independently in test/real data.
Should we regularise? Is altered functional form out of scope (I assume so)? Mainly thinking about improving impropers to something harmonic Currently improper force constants are badly behaved and strongly coupled to angles LW: agree, Jessica Maat started looking at this before she left. Would say it’s out of scope for Sage 2.4 but looking into this would be great
Experimentation with other parts of the process? Other optimizers? Simulated annealing / basin hopping? FC: played a bit with L-M optimizer, we got similarish results to Adam. The parameters look a bit different. This was with bespoke valence training.
Training to dimer data?
|
Experiments | Q1: does optimizing to optimized-geometry data and torsiondrive data improve fits? (LW has started this – taking over Jeff’s BTS project) Q2: does optimizing to the SPICE data in OMol25 + the GEOM data in OMol25 improve fits? – JC to start Q3: Restricting the specification approach to just bonds and angles? FC volunteered to look at. What features do we use? Symmetry (e.g. carboxylates)? Hyperspecified types are more prone to that. We need to do some work on making sure we don’t mess up symmetries. JC: the FF I’m going to send you won’t have this issue because bond types are generalized. LW: NAGL solves this by averaging all resonance forms, could maybe use this to validate or diagnose symmetry issues
Q4: Using two layers of specificity – one general, one very specific, seeing if that improves outliers Q5: The regularisation Should make sure everyone’s using the same mini-batching + learning rates Q7: comparison of linearised harmonics to non-linearised harmonics, nailing down some of the hyperparameters FC Workflow: FC: I think we have a good solution with this now for bespoke fitting. DC: from memory it was a game changer bringing this in. LW: agree, would be useful to compare minibatch + normal vs minibatch + linearised and identify how much improvement came from there.
Meetings – keep in Newcastle for now Work
|