Experimenting with selecting different concentrations of mixtures for enthalpies of mixing, and weighting
Date: Oct 9, 2025
Most recent FF release: 2.3.0rc2
Background reading:
Problem
At present, we fit vdW terms to a combination of densities and enthalpies of mixing, and as expected we generally see improvement across both properties. However, the distribution of the results looks very different. Where densities are relatively well-behaved, enthalpies of mixing are distributed fairly widely with large errors.
We know that many of the high-error compounds are nitrogen-containing compounds with water, and we are experimenting with splitting the N vdW term to increase the complexity. However, there are a couple other directions we could consider.
Firstly, the rescaling factors of 0.05 g/mL for densities and 1.6 kJ/mol may weight gradient contributions more heavily towards densities. Below we investigate the gradient contributions per property towards the epsilon parameter of the sp2 carbon vdW term, for the Sage 2.0 fit. The contributions from densities are noticeably higher than those for enthalpies of mixing.
The other issue may be in how we currently select data points. We currently try to select binary enthalpies of mixing at regular concentration intervals: 25/75, 50/50, 75/25. However, enthalpy of mixing curves are rarely so symmetric, and usually the curve is skewed in a particular direction. One hypothesis is that this can confuse the optimizer; for example, in the scenario where the experimental peak is around 60/40 (e.g. o-toluidine/*-xylene below), the experimental values at 50/50 and 75/25 can look very similar.
Possible research questions:
Does selecting data points to include the peak of the enthalpy of mixing curve improve performance on enthalpy of mixing data?
Does increasing weights improve enthalpy of mixing data?
General workflow
Reference repo:
The general workflow of an FF fit here is to:
Download and curate data from the NIST ThermoML database.
Run a vdW re-fit to mixtures of densities and enthalpies of mixing.
Typically, we would do a valence re-fit afterwards; but for an experimental project like this one, we can stop and examine the improvement on training data to get a gauge for how promising the approach is for now.
I’d recommend starting from Sage 2.2.1 just because that’s our most recent full release, and replacing the charge model with AshGC. What this will require is removing the ToolkitAM1BCC handler, and adding the NAGLCharges handler. This will need the most recent version of the OpenFF force fields package. I’ve shown an example of programmatic editing below, else you could just modify the offxml.
>>> from openff.toolkit import ForceField
>>> ff = ForceField("openff-2.2.1.offxml")
/Users/lily/micromamba/envs/test-openff-nagl-models-2025-09-0/lib/python3.13/site-packages/openff/amber_ff_ports/amber_ff_ports.py:8: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
from pkg_resources import resource_filename
>>> ff.deregister_parameter_handler("ToolkitAM1BCC")
>>> ff.add_parameter_handler("NAGLCharges")
>>> ff.get_parameter_handler("NAGLCharges", handler_kwargs=dict(model_file="openff-gnn-am1bcc-1.0.0.pt", model_file_hash="7981e7f5b0b1e424c9e10a40d9e7606d96dcd3dd2b095cb4eeff6829f92238ee", version="0.3"))
>>> ff.to_file("test.offxml")
Environment
I suggest creating a new environment with the most up-to-date versions of the OpenFF infrastructure stack:
OpenFF Evaluator (0.5.x)
OpenFF Toolkit
OpenFF interchange
loguru (for nice logging bars) and tqdm (progress bars)
OpenEye, OpenFF has a license – this parses molecules much faster than RDKit
Downloading and curating data
Start from the cleaned data set processed here to avoid having to re-download the entire ThermoML NIST database. I’ve linked the README that describes how the data was cleaned/fixed.
I also suggest starting from the version that has high-viscosity components filtered out as we’ve had trouble equilibrating those properties. I’ll attach both CSVs here.thermoml-cleanedis the cleaned dataset,clean-filtered-viscositiesis the same dataset with high-viscosity components removed.Modify this script for curating an intermediate dataset of interest from the input dataset above. Specifically, we’ll need to remove the last line that only selects target states around 25/75, 50/50, 75/25% mixtures:
We may also want to reconsider the property filter on line 72 that only selects properties where both densities and enthalpies of mixing are present. What this effectively does is equalizes the proportions of the densities and enthalpies of mixing, otherwise there are way more densities available. However it does exclude some chemical diversity and there’s almost certainly a better way to equalize the proportions.
Modify this script for curating an actual training set from the intermediate set above. A couple notes – lines 131-140 with the SelectNumRepresentation filter should be removed. They require a special branch of OpenFF Evaluator, and moreover have the undesired effect of filtering out pure densities. Here as well, the SelectDataPointsSchema with the target states will need to be removed to avoid the 25/75, 50/50, 75/25 selections.
Here is where we’ll need to figure out how to select new data points for the enthalpy of mixing to capture the shape of the curve. If we want to use the same filter/schema framework as the other filters, we’ll unfortunately need to fork Evaluator and work off a branch, similar to the changes I made in the
branch. Pydantic, which underpins the Evaluator filter classes, doesn’t play nicely with custom filters defined in the script.If we want to just filter inside the script that’s probably easiest.
Aiming for ~1.0-1.3k training properties is consistent with the size of datasets we’ve used before, as discussed we can expand with more GPU time.
One step I haven’t included here but is helpful is to exclude any properties without uncertainty/error values, this causes Evaluator to bug out.
Fitting
Fitting happens in a two-stage workflow where we split the equilibration and pre-equilibrated simulation steps, to speed up equilibration and the production of uncorrelated samples for calculating physical properties.
Scripts for equilibrating are in the link below. Lily can copy over an existing
stored_datadirectory with existing equilibrated boxes to avoid repeating work, if provided a destination path on UCI or Iris. (It’s multiple GB so too big to attach here). Coming up with a better way to do this is on the to-do list as well.Basically here we ask Evaluator to initialize coordinates with Packmol, minimize, then equilibrate for up to 200 ns or until some number of uncorrelated samples are drawn of densities and potential energies.
Note, this workflow can have hidden bugs, the most annoying of which is still open. If this bug pops up you won’t see an error in the equilibration step, but when you try to re-fit from the equilibrated box:
Scripts for re-fitting are in the link below. Lily did this on Iris so happily there are also Iris run scripts available.
One open issue, is the prior width/rescaling factor to set for
constrained_epsilon. If left unspecified as done so in the optimize.in file there, ForceBalance uses a value of 1.297 (not sure where from). This results in the gradient contributions to epsilon being weighted higher than those to rmin_half, which may be why epsilon changed so much in the Sage 2.3.0rc2 fit. I think maybe experiment with setting constrained_epsilon to 0.5.
Scripts for looking at various aspects of the refit are in