Binary Mixture Data Feasibility Study
Team | @Simon Boothroyd @Owen Madin |
---|---|
Status | in progress |
GitHub | https://github.com/openforcefield/nistdataselection/projects/1 |
Scripts / Data |
|
Overview
In this study we aim to more rigorously understand whether it is more beneficial to optimize the non-bonded interaction parameters of a force field on solely pure data, binary mixture data, or a combination of both, with an emphasis here on density (including density , and excess molar volume ) and enthalpy (including enthalpy of vaporization and enthalpy of mixing ) data.
We anticipate that training a force field on mixture data will improve its performance at reproducing mixture properties while slightly degrading its performance on pure properties. Vice versa, we would expect that training on pure properties would improve its performance on pure properties while slightly degrading its performance on mixture properties. Here we aim to identify how much mixture properties improve relative to the degradation of pure properties when training on mixtures, compared to how much pure properties improve relative to the degradation of mixture properties when training on pure properties.
In an attempt to make this study as systematic as possible we have chosen to design it as if we had all of the data of interest available to us, rather than just including all data we have available. This will involve sourcing data from outside of ThermoML (mainly from the sources reviewed in [1]) which, while including a significant amount of information, does not cover a particularly diverse region of chemical space, nor does it contain large quantities of certain properties which have historically been important in force field optimization (namely ). This should enable the results and understanding generated here to facilitate more long term planning of what data we should focus on collecting (either from the literature, our industry partners, or in house via directed, automated experiments).
The intention is to keep the scope of the study as tight as possible so as to more readily enable the de-convolution of the effects of including different data types. As such, this study will be initially limited to systems composed of only alcohols and esters (the two species for which we have the most density and enthalpic mixture data available), and to data points measured at close to ambient (, ) conditions. This will then be expanded to other mixtures to ensure that the initial results generalise to other types of systems.
In general the studies proposed will proceed by:
Sourcing a training set of molecules and selecting particular data points for each system of interest.
Optimizing against the training set using ForceBalance in combination with the OpenFF Evaluator, starting from the
openff-1.0.0
force field parameters.Benchmarking the optimized force field against a test set of , , , and data points measured for an array of alcohols, esters and acids (here is used to indicate a property as a function of the composition of a binary mixture). This will be extended to other functionalities as the study starts to generalise, such as amines.
Note: The scripts created to facilitate this study can be found in the nistdataselection repository in the studies folder.
Compound Selection
We enforce the criteria that all compounds selected for the studies must have available all of , , and (possibly obtained through the conversion of and where is not directly available) data.
While it has been hypothesized that we should only be interested in choosing molecules so as to achieve a uniform coverage of SMIRKS patterns, this may be problematic. The same SMIRKS pattern (especially the ones without specificity) can match multiple different chemical environments within molecules, and may require different data to be more effectively constrained. This would likely not be an issue in the regime of an excess of data, but given the scarcity of data (especially diverse data) it would be good to remove the possibility of this having an effect.
Alcohol-Ester (+Acid)
The initial data set selection involved selecting a set of six common alcohols (methanol, ethanol, propanol, isopropanol, isobutanol and tert-butanol), and for each alcohol selecting approximately two esters, one larger and one smaller, for which both and data is obtainable. The components selected according to this criteria are then intended to be used as the components in the pure data only study.
Extension to Extra Functionalities
To ensure that the initial results generalise to other classes of systems, we aim to extend the study from alcohols and esters (two polar classes of molecules, which are h-bond donor / acceptors and h-bond acceptors respectively) to a broader spectrum of functionalities.
In choosing the extra data, we again enforce the restriction that all molecules chosen must have available all of , , . Given difficulties in optimising against as highlighted by the alcohol-ester studies, we choose to remove this property from the optimisations and only consider it as part of benchmarking.
The number of systems which meets this criteria is somewhat limited, and will somewhat restrict which extra functionalities may be included. Ideally, at least three extra classes of mixtures will be considered:
Mixtures of a polar and a (much) less polar component. This will complement the polar-polar alcohol and ester mixtures and ensure that the results are not strictly limited to more miscible systems (this could possibly include systems where each component is an H-bond accepter).
Mixtures of two relatively non-polar liquids. Such systems should be relatively miscible but in a different way to polar mixtures (mainly VdW interactions as opposed to strong H-bonds).
Table 1. The number of unique systems for which the different types of data (and combinations of such) are available. The table includes counts for all combinations of mixtures containing a spectrum of different functionalities (ester, ketone, etc.). Only mixture types with at least five and data point are shown.
Environment 1 | Environment 2 | Hmix(x) | rho(x) | Hmix(x) + rho(x) |
---|---|---|---|---|
ester | halogenated | 123 | 184 | 123 |
ester | alkane | 104 | 92 | 73 |
ether | aromatic | 46 | 143 | 30 |
alcohol | ester | 48 | 134 | 28 |
alcohol | aromatic | 49 | 212 | 25 |
alcohol | heterocycle | 28 | 146 | 24 |
alcohol | ether | 36 | 136 | 23 |
aromatic | heterocycle | 30 | 100 | 23 |
alcohol | alkane | 38 | 105 | 21 |
aromatic | aromatic | 49 | 121 | 21 |
halogenated | amide | 24 | 37 | 20 |
aromatic | alkane | 35 | 130 | 16 |
ketone | heterocycle | 17 | 28 | 15 |
alcohol | alcohol | 44 | 193 | 15 |
ether | halogenated | 17 | 66 | 13 |
amine | aromatic | 14 | 35 | 12 |
halogenated | heterocycle | 21 | 53 | 11 |
ketone | amine | 20 | 13 | 10 |
ether | alkane | 34 | 51 | 9 |
ether | ketone | 9 | 14 | 9 |
amine | heterocycle | 11 | 45 | 8 |
amine | alkane | 16 | 39 | 8 |
heterocycle | heterocycle | 11 | 16 | 7 |
ester | ester | 10 | 25 | 6 |
amide | aromatic | 13 | 62 | 6 |
alcohol | amine | 10 | 125 | 6 |
alcohol | halogenated | 7 | 92 | 6 |
ester | aromatic | 7 | 99 | 6 |
heterocycle | alkane | 22 | 40 | 5 |
alcohol | ketone | 38 | 29 | 5 |
As is shown in Table 1., there is only a limited selection of mixture types for which there is the requisite data available. In particular, there is only a limited selection of mixtures for which there are 10 or more unique substances (the number used for the alcohol-ester study) - the aim is to include at least 10 (or as close to this as possible) unique substances per set of interaction types in attempt to ensure the results are significant.
As such, the most promising functionalities to include which meet the above threes class of mixtures would be:
mixtures of ethers and ketones (acceptors only) and mixtures of alkanes and alcohols (less polar and polar).
mixtures of alkanes and ethers (apolar and only marginally polar).
The available molecules to select from are listed in the following attachments:
Parameters to Optimize
Alcohol-Ester (+Acid)
Data sets containing only alcohols, esters and acids (containing only C, H and O) will exercise a total of 18 SMIRNOFF parameters exercised (9 different smirks patterns). For these studies we will keep fixed any overly generic hydrogen parameters, as well as the [#1:1]-[#8] parameter which should stay fixed at epsilon=0.0, allowing a total of 12 parameters (for 6 different smirks patterns) to be optimized.
Exercised but Won’t be Refit:
[#1:1]-[#6X4]-[#7,#8,#9,#16,#17,#35]
[#1:1]-[#6X3](~[#7,#8,#9,#16,#17,#35])~[#7,#8,#9,#16,#17,#35]
[#1:1]-[#8]
Will be Refit:
[#1:1]-[#6X4]
[#6:1]
[#6X4:1]
[#8:1]
[#8X2H0+0:1]
[#8X2H1+0:1]
Only Pure Data
We will perform a set of optimizations including only pure densities and enthalpies of vaporization as our baseline study, given that this is what has been historically used in force field development.
We will train against the same set of compounds as we use in the mixture study, and for a single and data point for each measured at as close to ambient conditions as possible in an attempt to keep variability as low as possible. This decision was made in order to keep the data set consistent, and to keep focus on the predictive power of the respective data types rather than which molecules were included.
While Density data for these compounds will be sourced from ThermoML, no corresponding enthalpy of vaporization data is available within the data collection. As such, we intend to source the data externally, using DIPPR as guide to which data to select, but ultimately retrieving data directly from the original publication source so as to avoid infringement issues.
While arguably this training set will be much smaller than the mixture data set, and hence may lead to overfitting, it is challenging to expand this set much further while maintaining the desired systematic nature of the study. Further, any extra data to include in this study would need to be manually sourced from the literature given the in-availability of data in ThermoML. In principle however, this baseline study can be expanded to include compounds outside of the mixture data training sets if it is found that the pure training set is too small to constitute a fair study.
Chosen Alcohol-Ester Data Set
Only Mixture Data
We will conduct three sets of independent optimizations on:
a training set which includes only binary + data points
a training set which includes only binary + and data points
a training set which includes only binary + data points
In principle and + should provide the same information content (given that they are linear combinations of each other), however they will differ slightly in their contributions to a least squares objective function (and gradient thereof), whereby the information content of + may be higher as it explicitly includes contributions of pure densities, while these are implicit in the case of . By doing both optimizations we aim to determine whether there is any practical difference between the two.
The + without study will enable us to explore whether the pure densities contribute significantly to constraining the optimization, or whether including by itself would be sufficient. If this is the case:
optimizations would require less experimental data as we avoid needing pure densities.
we would be able to use in place of for which there is less data available.
optimizations would require less simulations.
Initially we plan to include 10 pairs of molecules in the training set (~ 2 different esters for each different alcohol to include), and 3 different composition (25% 50% and 75%) per pair of molecules (~60 data points in total). This may be expanded if it is found that this set is too small to offer any significant insight.
Chosen Alcohol-Ester Data Set
Pure + Mixture Data
We will conduct three sets of independent optimizations on:
a training set which includes only binary + and + data points
a training set which includes only binary + and + data points
a training set which includes only binary + and data points
Note: We may not end up conducting a number of these studies depending on the results of the Only Mixture Data studies.
We aim to use these sub-studies to explore whether pure data is needed to sufficiently constrain the optimization when mixture data is included, with a focus on whether it is important to include to constrain cohesive energies, or whether is sufficient.
The training set for these studies would be the union of the training sets used in the Only Pure Data and Only Mixture Data studies.
Chosen Alcohol-Ester Data Set
Initial Benchmark Set Selection
The initial benchmark set is expected to be modest in size (~50 pure data points, ~120 mixture data points) so as to be able to rapidly assess the performance of optimisations, but will be complemented by further sets depending on the outcomes of the initial set.
We will be less systematic in the selection of those systems to include in the benchmark set, opting instead to aim to curate a set which has a diverse set of molecules with pure density, enthalpy of vaporization data points, and binary enthalpy of mixing, of excess molar volume, and binary mass density data points, without enforcing that substances must have all such be available to be included (as was the case for the training sets).
In order to test how well each of the different produced force fields generalise, we initially aim to include binary mixtures of alcohols and alcohols, alcohols and esters (/ acids), and esters (/acids) and esters (/acids).
This will likely be expanded to ethers and other such additional moieties, however this will be done after this initial set has been benchmarked against.
In an attempt to ensure that we are testing the performance of the refit parameters, rather than the full Parsley 1.0.0 force field, we will exclude any
aromatic compounds
compounds containing 3-4 membered rings.
compounds containing alkane chains greater than 6 atoms in length.
again, this will likely be relaxed in future benchmark sets.
This set will only contain mixtures whereby neither of the components appear in the training set. Future data sets may then be complement with mixtures which do partially contain training data to further explore interesting results highlighted by this initial set.
Chosen Alcohol-Ester Study Benchmark Set
The results shown in the page where generated against a data set which contained
~110 pure data points (~60:40 split of and , ambient condtions) where none of the substances appeared in the test set.
~320 mixture data points (with roughly equal numbers of , and data points, ambient conditions, three compositions (~25%, ~50%, ~75%) per pair). This included mixtures where neither component appeared in the test set, and alcohol-alcohol and ester-ester mixtures where both components appeared in the test set (the train set only included alcohol-ester mixtures)
Extended Benchmark Set
Similar to the selection of the benchmark set chosen for the initial alcohol and ester study, when selecting the extended benchmark set we opted to be less systematic in choosing the individual substances (namely, we did not require that each substance had available all of the properties of interest) and instead, focused on trying to choose as diverse a set as was possible (given a limited diversity of the available data) which maximally exercised the refit parameters.
We endeavoured to select a set which:
did not contain exactly on substances which were trained on, but did allow substances where individual components did appear in the training set (e.g if the training set included a mixture of pentanol and hexane, the test set would be allowed to contain pentanol and heptane and hexanol and hexane but not pentanol and hexane).
contain substances as distinct as possible from the training set, and from other molecules in the test set.
The test set is to contain , , , and data points for substances (both pure and binary) composed only of alkanes, alcohols, ethers, esters and ketones.
The selection of the benchmark set proceeds as follows:
Filter out all of the data points which were measured for:
substances that were included in the training set.
molecules not composed of C, O and H.
molecules with undefined stereochemistry.
long chain ethers or alkanes (these are difficult to pack into a simulation box and in general take longer to simulate).
molecules which contain 1, 3 di-carbonyl functionality where at least one of the carbonyl groups was a ketone. Substances containing such will likely contain mixtures of keto-enol tautomers, but the ratio in which they will be present isn’t recorded by ThermoML.
Cluster all of the available , , data points based on the chemical environments present in the substances that the data was measured for (e.g. cluster all ether-ester data, ketone-alcohol data etc.)
For each type of property (e.g. ) and each pair of environments (e.g. alcohol-ester):
Select the substance which is ‘most distinct’ from the currently selected training set and the molecules selected for the test set of this property and environment.
Repeat a. until either 10 substances have been selected, or there are no other substances to select from.
Defining ‘most distinct’
To determine how similar a substance is to another set of substances, we defined a distance metric based on a substance finger print.
For any binary substance composed of components a
and b
(represented as [a
, b
]), the substance’s finger print is defined as [f(a)
, f(b)
] where f(x)
is a function which computes the OpenEye Tree
finger print of a molecule x
.
The distance between any two substances ([a_1
, b_1
], [a_2
, b_2
]) is then defined as
min(d(f(a_1), f(a_2)) + d(f(b_1), f(b_2)),
d(f(a_1), f(b_1)) + d(f(a_2), f(b_2)))
where d
is the OETanimoto
distance between two fingerprints.
The distance between a given substance mixture_a
and a set of mixtures mixture_set
is then computed by:
Computing the distance between
mixture_a
and each mixture inmixture_set
Remove the mixture from
mixture_set
which is closest tomixture_a
and adding the distance between the two to the total distance.Repeating step 2 until all mixtures have been removed from the
mixture_set
.
The most distinct molecule from the unselected set is thus chosen by:
Selecting the molecule which is 'furthest' away from both the training set and the currently selected test set (which starts off empty), where the distance is defined as:
sqrt(compute_distance_with_set(unselected_substance, training_set) ** 2 +
compute_distance_with_set(unselected_substance, test_set) ** 2)
Moving the selected molecule from the unselected set into the test set.
Repeat steps 1 and two until either the target number of molecules have
been selected, or there are no more unselected molecules to choose from.
The pure substances to include for , , where then mainly chosen as those components chosen as part of the mixture properties where available, and components close to those where not possible.
Chosen Set
The final set contains ~48 pure data points and ~ 900 mixture data points.
Questions we want to answer via benchmarking/what benchmark sets can we use to achieve this?
In general, to what extent are LJ parameters trained on mixture data transferable to other sets of mixture?
The first mixture benchmark set (MB1), consisting of heat of vaporization, mixture density, and excess molar volume of alcohol/ester, alcohol/alcohol, alcohol/acid, ester/acid, acid/acid and alcohol/ether mixtures that do not have any commonality to the mixtures that we trained on.
If we train LJ parameters on one type of mixture, how well do those parameters transfer to other types of mixture?
Since we included only alcohol/ester mixtures in our training data, MB1 will allow us to look at the transferability of LJ parameters to other types of mixtures.
Alcohol/alcohol mixtures: We train on mixtures that contain alcohols, but only mixtures of alcohol with esters. To what extent do the alcohol LJ parameter transfer to mixtures that only include alcohols.
Alcohol/ether mixtures: To what extent do the LJ parameters for alcohols, trained on ester mixtures, transfer to mixtures of alcohols and ethers (which have not been trained on at all)? Ethers should be similar to esters, so if they do quite poorly, this will be an issue.
How do benchmarked mixture properties vary as a function of composition?
Through benchmarking, can we identify the extent that transferability affects mixture properties as a function of composition? For example, if we test against alcohol/ether mixtures and we see worse performance as the ether concentration increases, then maybe the alcohol parameters are good, but the ester parameters don’t transfer well to ethers.
MB1 will allow us to explore this, since we should have good coverage in a range of xA=0.2-0.8. We should also consider adding some points in the 0.9-0.95 mole fraction region, to check on that behavior. This could either be as a separate set, or just something we break out in MB1
Can we identify a “spectrum of transferability” for parameters in mixtures.
For example, within a benchmark set composed of mixtures that have at least one component in the training set (MB2), there are a large number of mixture with tert-butanol. Assuming that tert-butanol is parameterized reasonably well, by examining the mixture properties and chemical similarity of the other moieties in these tert-butanol sets, can we identify how different a mixture can be before the transferability starts to degrade?
To what extent are mixture properties transferable from training on pure properties only? To what extent are pure properties transferable from training on mixture properties only?
Can we get a sense of the correlation between performance on mixture properties and pure properties (does low error in pure density imply low error in mixture density)? By benchmarking sets parameterized with only pure and only mixture data on MB1 and PB1 (the basic pure data benchmark set), we can analyze this.
How do mixtures trained on mixture densities perform on excess molar volumes?
By looking at the subset of MB1 that includes excess molar volumes, can we accurately reproduce these by training on mixture densities? If we can’t, that may point to excess molar volumes not being very useful for us.
References
[1] Majer, V., Svoboda, V, Enthalpies of Vaporization of Organic Compounds (IUPAC Chemical Data)