2025-07-01 FF Fitting Meeting

2025-07-01 FF Fitting Meeting

Participants

  • @Jennifer Clark

  • @David Mobley

  • Julia Rice

  • @Jeffrey Wagner

  • @Lily Wang

  • @Pavan Behara

  • Bill Swope

Recording

Discussion topics

Notes

Notes

  • Alberto talk

    • LW – July 29 work?

    • DM – Works for me, I’ll message him

    • LW – I’ll add him to agenda

    •  

AshSage updates

  • LW will upload slides here

  • JR (slide 9) – Can you put some bromoalkanes in the training set to fix issues?

    • LW – We can and did, however other mixture componnts may not have had enough diversity. In training set we had more mixtures of bromoalkanes with alcohols.

  • JR slide 10 – What are the green series above the line? ANd would be helpful to show r2

    • LW – Can’t recall what the green series is. And I have r2 numbers, can include next time.

  • JW slide 12 – Could the nonzero values in the bottom two rows have been avoided?

    • LW – Possibly, would require changing how we do the training/test split. Some of the exclusions were by design. The main thing I’d do differently is to curate mixture components between groups - Like with alkyl bromides with alcohols vs. esters

  • JC 19 – Difference between ash-nagl charges for alcohols compared to previous partial charges (not that partial charges would be worse, but less “enmeshed” with the non-bonded parameters)? Also wonder if differentiation between primary/secondary alcohols is sufficiently baked into NAGL models.

    • LW – I haven’t looked at charges for these mols specifically, but ash-nagl charges have performed well for alcohols in general, instead things like sulfurs might make them worse. But I’d fully expect NAGL to learn difference between primary/secondary/tertiary alcohols.

  • JR 22 – Plotting against RMSE - What is the actual value of the density (to understand the magnitude of the error)?

    • LW – Don’t recall off the top of my head. Would it be helpful to see MUE next time?

    • DM – Maybe % error or % RMSE?

  • JW 25 – These seem to strongly imply that we should pull more of the validation set into training set

    • LW – Yes, could make training much more expensive, but likely worth it. Also this is a little confounded by the breakdowns of components in mixtures.

  • JW 28 – Seems like there were some really large differences between training and testing - often going in the wrong direction altogether. Do you think that that indicates a larger training set is in order?

    • DM: The issue is when looking at really generic functional groups like alkanes, is that whatever is attached to it will dictate the amount of error, so it’s not that the alkanes themselves are off

    • LW: Like DM said,

    • DM – In doing similar analysis, looking at whether a compound was an alkane or aromatic was incidental since most of the error was contributed by other components/functional groups. So it may be useful to do an analysis to see whether ADDING a functional group ADDS error.

    • JC – In genentech work, we found that train-test set contamination can be really deceptive, and come to a completely different conclusion given the same data depending on how its split.

    • LW – Since our vdW terms are so general, we’d hope that overtraining would be easier to keep at bay. So I still think that doing stuff like ensuring we have a mixture of alkyl bromides and alcohols in each set.

    • JR – NAGL charging only looks at one molecule, so it’s not aware of the other components of the mixture

      • LW – Right, but the vdW terms are trained in the presence of other components of the mixture.

    • PB (chat) – reg. David's idea of plotting overrepresented parameters, for Sage we were looking at the ratio of (fraction of mols a parameter is matched to in the subset with large errors, say ddE > 5 kcal/mol or any other metric to pick the subset) to (fraction of mols the same parameter is matched to on the whole set).
      So, any parameter having a value greater than 1 is heavily represented in the large discrepancy set and the higher the value the parameter is overrepresented.

      • DM (chat) – Ooh yes I’d forgotten about that. That kind of thing is valuable. And you can also do error bars on it in a straightforward manner.

  • JR 50 – More mols here than in enthalpies of mixing. Not surprising since phys prop data is harder to come by. Wonder whether training on this high a number of compounds is better.

    • LW: CB had a similar question last week, the short answer is that we haven’t performed experiments on dataset sizes.

    •  

  • PB: reg. David's idea of plotting overrepresented parameters, for Sage we were looking at the ratio of (fraction of mols a parameter is matched to in the subset with large errors, say ddE > 5 kcal/mol or any other metric to pick the subset) to (fraction of mols the same parameter is matched to on the whole set).

    So, any parameter having a value greater than 1 is heavily represented in the large discrepancy set and the higher the value the parameter is overrepresented.

  • FYI, writing group

    •  

Action items

Decisions