Binary Mixture Data Feasibility Study

Team

@Simon Boothroyd @Owen Madin

Status

in progress

GitHub

https://github.com/openforcefield/nistdataselection/projects/1

Scripts / Data

Overview

In this study we aim to more rigorously understand whether it is more beneficial to optimize the non-bonded interaction parameters of a force field on solely pure data, binary mixture data, or a combination of both, with an emphasis here on density (including density , and excess molar volume ) and enthalpy (including enthalpy of vaporization and enthalpy of mixing ) data.

We anticipate that training a force field on mixture data will improve its performance at reproducing mixture properties while slightly degrading its performance on pure properties. Vice versa, we would expect that training on pure properties would improve its performance on pure properties while slightly degrading its performance on mixture properties. Here we aim to identify how much mixture properties improve relative to the degradation of pure properties when training on mixtures, compared to how much pure properties improve relative to the degradation of mixture properties when training on pure properties.

In an attempt to make this study as systematic as possible we have chosen to design it as if we had all of the data of interest available to us, rather than just including all data we have available. This will involve sourcing data from outside of ThermoML (mainly from the sources reviewed in [1]) which, while including a significant amount of information, does not cover a particularly diverse region of chemical space, nor does it contain large quantities of certain properties which have historically been important in force field optimization (namely ). This should enable the results and understanding generated here to facilitate more long term planning of what data we should focus on collecting (either from the literature, our industry partners, or in house via directed, automated experiments).

The intention is to keep the scope of the study as tight as possible so as to more readily enable the de-convolution of the effects of including different data types. As such, this study will be initially limited to systems composed of only alcohols and esters (the two species for which we have the most density and enthalpic mixture data available), and to data points measured at close to ambient (, ) conditions. This will then be expanded to other mixtures to ensure that the initial results generalise to other types of systems.

In general the studies proposed will proceed by:

  • Sourcing a training set of molecules and selecting particular data points for each system of interest.

  • Optimizing against the training set using ForceBalance in combination with the OpenFF Evaluator, starting from the openff-1.0.0 force field parameters.

  • Benchmarking the optimized force field against a test set of , , , and data points measured for an array of alcohols, esters and acids (here is used to indicate a property as a function of the composition of a binary mixture). This will be extended to other functionalities as the study starts to generalise, such as amines.

Note: The scripts created to facilitate this study can be found in the nistdataselection repository in the studies folder.

Compound Selection

We enforce the criteria that all compounds selected for the studies must have available all of , , and (possibly obtained through the conversion of and where is not directly available) data.

While it has been hypothesized that we should only be interested in choosing molecules so as to achieve a uniform coverage of SMIRKS patterns, this may be problematic. The same SMIRKS pattern (especially the ones without specificity) can match multiple different chemical environments within molecules, and may require different data to be more effectively constrained. This would likely not be an issue in the regime of an excess of data, but given the scarcity of data (especially diverse data) it would be good to remove the possibility of this having an effect.

Alcohol-Ester (+Acid)

The initial data set selection involved selecting a set of six common alcohols (methanol, ethanol, propanol, isopropanol, isobutanol and tert-butanol), and for each alcohol selecting approximately two esters, one larger and one smaller, for which both and data is obtainable. The components selected according to this criteria are then intended to be used as the components in the pure data only study.

Extension to Extra Functionalities

To ensure that the initial results generalise to other classes of systems, we aim to extend the study from alcohols and esters (two polar classes of molecules, which are h-bond donor / acceptors and h-bond acceptors respectively) to a broader spectrum of functionalities.

In choosing the extra data, we again enforce the restriction that all molecules chosen must have available all of , , . Given difficulties in optimising against as highlighted by the alcohol-ester studies, we choose to remove this property from the optimisations and only consider it as part of benchmarking.

The number of systems which meets this criteria is somewhat limited, and will somewhat restrict which extra functionalities may be included. Ideally, at least three extra classes of mixtures will be considered:

  • Mixtures of a polar and a (much) less polar component. This will complement the polar-polar alcohol and ester mixtures and ensure that the results are not strictly limited to more miscible systems (this could possibly include systems where each component is an H-bond accepter).

  • Mixtures of two relatively non-polar liquids. Such systems should be relatively miscible but in a different way to polar mixtures (mainly VdW interactions as opposed to strong H-bonds).

Table 1. The number of unique systems for which the different types of data (and combinations of such) are available. The table includes counts for all combinations of mixtures containing a spectrum of different functionalities (ester, ketone, etc.). Only mixture types with at least five and data point are shown.

Environment 1

Environment 2

Hmix(x)

rho(x)

Hmix(x) + rho(x)

Environment 1

Environment 2

Hmix(x)

rho(x)

Hmix(x) + rho(x)

ester

halogenated

123

184

123

ester

alkane

104

92

73

ether

aromatic

46

143

30

alcohol

ester

48

134

28

alcohol

aromatic

49

212

25

alcohol

heterocycle

28

146

24

alcohol

ether

36

136

23

aromatic

heterocycle

30

100

23

alcohol

alkane

38

105

21

aromatic

aromatic

49

121

21

halogenated

amide

24

37

20

aromatic

alkane

35

130

16

ketone

heterocycle

17

28

15

alcohol

alcohol

44

193

15

ether

halogenated

17

66

13

amine

aromatic

14

35

12

halogenated

heterocycle

21

53

11

ketone

amine

20

13

10

ether

alkane

34

51

9

ether

ketone

9

14

9

amine

heterocycle

11

45

8

amine

alkane

16

39

8

heterocycle

heterocycle

11

16

7

ester

ester

10

25

6

amide

aromatic

13

62

6

alcohol

amine

10

125

6

alcohol

halogenated

7

92

6

ester

aromatic

7

99

6

heterocycle

alkane

22

40

5

alcohol

ketone

38

29

5

As is shown in Table 1., there is only a limited selection of mixture types for which there is the requisite data available. In particular, there is only a limited selection of mixtures for which there are 10 or more unique substances (the number used for the alcohol-ester study) - the aim is to include at least 10 (or as close to this as possible) unique substances per set of interaction types in attempt to ensure the results are significant.

As such, the most promising functionalities to include which meet the above threes class of mixtures would be:

  • mixtures of ethers and ketones (acceptors only) and mixtures of alkanes and alcohols (less polar and polar).

  • mixtures of alkanes and ethers (apolar and only marginally polar).

The available molecules to select from are listed in the following attachments:

Parameters to Optimize

Alcohol-Ester (+Acid)

Data sets containing only alcohols, esters and acids (containing only C, H and O) will exercise a total of 18 SMIRNOFF parameters exercised (9 different smirks patterns). For these studies we will keep fixed any overly generic hydrogen parameters, as well as the [#1:1]-[#8] parameter which should stay fixed at epsilon=0.0, allowing a total of 12 parameters (for 6 different smirks patterns) to be optimized.

Exercised but Won’t be Refit:

  • [#1:1]-[#6X4]-[#7,#8,#9,#16,#17,#35]

  • [#1:1]-[#6X3](~[#7,#8,#9,#16,#17,#35])~[#7,#8,#9,#16,#17,#35]

  • [#1:1]-[#8]

Will be Refit:

  • [#1:1]-[#6X4]

  • [#6:1]

  • [#6X4:1]

  • [#8:1]

  • [#8X2H0+0:1]

  • [#8X2H1+0:1]

Only Pure Data

We will perform a set of optimizations including only pure densities and enthalpies of vaporization as our baseline study, given that this is what has been historically used in force field development.

We will train against the same set of compounds as we use in the mixture study, and for a single and data point for each measured at as close to ambient conditions as possible in an attempt to keep variability as low as possible. This decision was made in order to keep the data set consistent, and to keep focus on the predictive power of the respective data types rather than which molecules were included.

While Density data for these compounds will be sourced from ThermoML, no corresponding enthalpy of vaporization data is available within the data collection. As such, we intend to source the data externally, using DIPPR as guide to which data to select, but ultimately retrieving data directly from the original publication source so as to avoid infringement issues.

While arguably this training set will be much smaller than the mixture data set, and hence may lead to overfitting, it is challenging to expand this set much further while maintaining the desired systematic nature of the study. Further, any extra data to include in this study would need to be manually sourced from the literature given the in-availability of data in ThermoML. In principle however, this baseline study can be expanded to include compounds outside of the mixture data training sets if it is found that the pure training set is too small to constitute a fair study.

Chosen Alcohol-Ester Data Set

 

Only Mixture Data

We will conduct three sets of independent optimizations on:

  1. a training set which includes only binary + data points

  2. a training set which includes only binary + and data points

  3. a training set which includes only binary + data points

In principle and + should provide the same information content (given that they are linear combinations of each other), however they will differ slightly in their contributions to a least squares objective function (and gradient thereof), whereby the information content of + may be higher as it explicitly includes contributions of pure densities, while these are implicit in the case of . By doing both optimizations we aim to determine whether there is any practical difference between the two.

The + without study will enable us to explore whether the pure densities contribute significantly to constraining the optimization, or whether including by itself would be sufficient. If this is the case:

  • optimizations would require less experimental data as we avoid needing pure densities.

  • we would be able to use in place of for which there is less data available.

  • optimizations would require less simulations.

Initially we plan to include 10 pairs of molecules in the training set (~ 2 different esters for each different alcohol to include), and 3 different composition (25% 50% and 75%) per pair of molecules (~60 data points in total). This may be expanded if it is found that this set is too small to offer any significant insight.

Chosen Alcohol-Ester Data Set

Pure + Mixture Data

We will conduct three sets of independent optimizations on:

  1. a training set which includes only binary + and + data points

  2. a training set which includes only binary + and + data points

  3. a training set which includes only binary + and data points

Note: We may not end up conducting a number of these studies depending on the results of the Only Mixture Data studies.

We aim to use these sub-studies to explore whether pure data is needed to sufficiently constrain the optimization when mixture data is included, with a focus on whether it is important to include to constrain cohesive energies, or whether is sufficient.

The training set for these studies would be the union of the training sets used in the Only Pure Data and Only Mixture Data studies.

Chosen Alcohol-Ester Data Set

Initial Benchmark Set Selection

The initial benchmark set is expected to be modest in size (~50 pure data points, ~120 mixture data points) so as to be able to rapidly assess the performance of optimisations, but will be complemented by further sets depending on the outcomes of the initial set.

  • We will be less systematic in the selection of those systems to include in the benchmark set, opting instead to aim to curate a set which has a diverse set of molecules with pure density, enthalpy of vaporization data points, and binary enthalpy of mixing, of excess molar volume, and binary mass density data points, without enforcing that substances must have all such be available to be included (as was the case for the training sets).

  • In order to test how well each of the different produced force fields generalise, we initially aim to include binary mixtures of alcohols and alcohols, alcohols and esters (/ acids), and esters (/acids) and esters (/acids).

    • This will likely be expanded to ethers and other such additional moieties, however this will be done after this initial set has been benchmarked against.

  • In an attempt to ensure that we are testing the performance of the refit parameters, rather than the full Parsley 1.0.0 force field, we will exclude any

    • aromatic compounds

    • compounds containing 3-4 membered rings.

    • compounds containing alkane chains greater than 6 atoms in length.

    again, this will likely be relaxed in future benchmark sets.

  • This set will only contain mixtures whereby neither of the components appear in the training set. Future data sets may then be complement with mixtures which do partially contain training data to further explore interesting results highlighted by this initial set.

Chosen Alcohol-Ester Study Benchmark Set

The results shown in the page where generated against a data set which contained

  • ~110 pure data points (~60:40 split of and , ambient condtions) where none of the substances appeared in the test set.

  • ~320 mixture data points (with roughly equal numbers of , and data points, ambient conditions, three compositions (~25%, ~50%, ~75%) per pair). This included mixtures where neither component appeared in the test set, and alcohol-alcohol and ester-ester mixtures where both components appeared in the test set (the train set only included alcohol-ester mixtures)

Extended Benchmark Set

Similar to the selection of the benchmark set chosen for the initial alcohol and ester study, when selecting the extended benchmark set we opted to be less systematic in choosing the individual substances (namely, we did not require that each substance had available all of the properties of interest) and instead, focused on trying to choose as diverse a set as was possible (given a limited diversity of the available data) which maximally exercised the refit parameters.

We endeavoured to select a set which:

  • did not contain exactly on substances which were trained on, but did allow substances where individual components did appear in the training set (e.g if the training set included a mixture of pentanol and hexane, the test set would be allowed to contain pentanol and heptane and hexanol and hexane but not pentanol and hexane).

  • contain substances as distinct as possible from the training set, and from other molecules in the test set.

The test set is to contain , , , and data points for substances (both pure and binary) composed only of alkanes, alcohols, ethers, esters and ketones.

The selection of the benchmark set proceeds as follows:

  1. Filter out all of the data points which were measured for:

    1. substances that were included in the training set.

    2. molecules not composed of C, O and H.

    3. molecules with undefined stereochemistry.

    4. long chain ethers or alkanes (these are difficult to pack into a simulation box and in general take longer to simulate).

    5. molecules which contain 1, 3 di-carbonyl functionality where at least one of the carbonyl groups was a ketone. Substances containing such will likely contain mixtures of keto-enol tautomers, but the ratio in which they will be present isn’t recorded by ThermoML.

  2. Cluster all of the available , , data points based on the chemical environments present in the substances that the data was measured for (e.g. cluster all ether-ester data, ketone-alcohol data etc.)

  3. For each type of property (e.g. ) and each pair of environments (e.g. alcohol-ester):

    1. Select the substance which is ‘most distinct’ from the currently selected training set and the molecules selected for the test set of this property and environment.

    2. Repeat a. until either 10 substances have been selected, or there are no other substances to select from.

Defining ‘most distinct’

To determine how similar a substance is to another set of substances, we defined a distance metric based on a substance finger print.

For any binary substance composed of components a and b (represented as [a, b]), the substance’s finger print is defined as [f(a), f(b)] where f(x) is a function which computes the OpenEye Tree finger print of a molecule x.

The distance between any two substances ([a_1, b_1 ], [a_2, b_2]) is then defined as

min(d(f(a_1), f(a_2)) + d(f(b_1), f(b_2)),
d(f(a_1), f(b_1)) + d(f(a_2), f(b_2)))

where dis the OETanimoto distance between two fingerprints.

The distance between a given substance mixture_a and a set of mixtures mixture_set is then computed by:

  1. Computing the distance between mixture_a and each mixture in mixture_set

  2. Remove the mixture from mixture_set which is closest to mixture_a and adding the distance between the two to the total distance.

  3. Repeating step 2 until all mixtures have been removed from the mixture_set.

The most distinct molecule from the unselected set is thus chosen by:

  1. Selecting the molecule which is 'furthest' away from both the training set and the currently selected test set (which starts off empty), where the distance is defined as:

    sqrt(compute_distance_with_set(unselected_substance, training_set) ** 2 +
    compute_distance_with_set(unselected_substance, test_set) ** 2)

  2. Moving the selected molecule from the unselected set into the test set.

  3. Repeat steps 1 and two until either the target number of molecules have
    been selected, or there are no more unselected molecules to choose from.

The pure substances to include for , , where then mainly chosen as those components chosen as part of the mixture properties where available, and components close to those where not possible.

Chosen Set

The final set contains ~48 pure data points and ~ 900 mixture data points.

Questions we want to answer via benchmarking/what benchmark sets can we use to achieve this?

  • In general, to what extent are LJ parameters trained on mixture data transferable to other sets of mixture?

    • The first mixture benchmark set (MB1), consisting of heat of vaporization, mixture density, and excess molar volume of alcohol/ester, alcohol/alcohol, alcohol/acid, ester/acid, acid/acid and alcohol/ether mixtures that do not have any commonality to the mixtures that we trained on.

  • If we train LJ parameters on one type of mixture, how well do those parameters transfer to other types of mixture?

    • Since we included only alcohol/ester mixtures in our training data, MB1 will allow us to look at the transferability of LJ parameters to other types of mixtures.

      • Alcohol/alcohol mixtures: We train on mixtures that contain alcohols, but only mixtures of alcohol with esters. To what extent do the alcohol LJ parameter transfer to mixtures that only include alcohols.

      • Alcohol/ether mixtures: To what extent do the LJ parameters for alcohols, trained on ester mixtures, transfer to mixtures of alcohols and ethers (which have not been trained on at all)? Ethers should be similar to esters, so if they do quite poorly, this will be an issue.

  • How do benchmarked mixture properties vary as a function of composition?

    • Through benchmarking, can we identify the extent that transferability affects mixture properties as a function of composition? For example, if we test against alcohol/ether mixtures and we see worse performance as the ether concentration increases, then maybe the alcohol parameters are good, but the ester parameters don’t transfer well to ethers.

    • MB1 will allow us to explore this, since we should have good coverage in a range of xA=0.2-0.8. We should also consider adding some points in the 0.9-0.95 mole fraction region, to check on that behavior. This could either be as a separate set, or just something we break out in MB1

  • Can we identify a “spectrum of transferability” for parameters in mixtures.

    • For example, within a benchmark set composed of mixtures that have at least one component in the training set (MB2), there are a large number of mixture with tert-butanol. Assuming that tert-butanol is parameterized reasonably well, by examining the mixture properties and chemical similarity of the other moieties in these tert-butanol sets, can we identify how different a mixture can be before the transferability starts to degrade?

  • To what extent are mixture properties transferable from training on pure properties only? To what extent are pure properties transferable from training on mixture properties only?

    • Can we get a sense of the correlation between performance on mixture properties and pure properties (does low error in pure density imply low error in mixture density)? By benchmarking sets parameterized with only pure and only mixture data on MB1 and PB1 (the basic pure data benchmark set), we can analyze this.

  • How do mixtures trained on mixture densities perform on excess molar volumes?

    • By looking at the subset of MB1 that includes excess molar volumes, can we accurately reproduce these by training on mixture densities? If we can’t, that may point to excess molar volumes not being very useful for us.

References

[1] Majer, V., Svoboda, V, Enthalpies of Vaporization of Organic Compounds (IUPAC Chemical Data)