Team

Simon Boothroyd Owen Madin

Status

GitHub

https://github.com/openforcefield/nistdataselection/projects/1

Scripts / Data

Overview

In this study we aim to more rigorously understand whether it is more beneficial to optimize the non-bonded interaction parameters of a force field on solely pure data, binary mixture data, or a combination of both, with an emphasis here on density (including density , and excess molar volume ) and enthalpy (including enthalpy of vaporization and enthalpy of mixing ) data.

We anticipate that training a force field on mixture data will improve its performance at reproducing mixture properties while slightly degrading its performance on pure properties. Vice versa, we would expect that training on pure properties would improve its performance on pure properties while slightly degrading its performance on mixture properties. Here we aim to identify how much mixture properties improve relative to the degradation of pure properties when training on mixtures, compared to how much pure properties improve relative to the degradation of mixture properties when training on pure properties.

In an attempt to make this study as systematic as possible we have chosen to design it as if we had all of the data of interest available to us, rather than just including all data we have available. This will involve sourcing data from outside of ThermoML (mainly from the sources reviewed in [1]) which, while including a significant amount of information, does not cover a particularly diverse region of chemical space, nor does it contain large quantities of certain properties which have historically been important in force field optimization (namely ). This should enable the results and understanding generated here to facilitate more long term planning of what data we should focus on collecting (either from the literature, our industry partners, or in house via directed, automated experiments).

The intention is to keep the scope of the study as tight as possible so as to more readily enable the de-convolution of the effects of including different data types. As such, this study will be initially limited to systems composed of only alcohols and esters (the two species for which we have the most density and enthalpic mixture data available), and to data points measured at close to ambient (, ) conditions. This will then be expanded to other mixtures to ensure that the initial results generalise to other types of systems.

In general the studies proposed will proceed by:

Note: The scripts created to facilitate this study can be found in the nistdataselection repository in the studies folder.

Compound Selection

We enforce the criteria that all compounds selected for the studies must have available all of , , and (possibly obtained through the conversion of and where is not directly available) data.

While it has been hypothesized that we should only be interested in choosing molecules so as to achieve a uniform coverage of SMIRKS patterns, this may be problematic. The same SMIRKS pattern (especially the ones without specificity) can match multiple different chemical environments within molecules, and may require different data to be more effectively constrained. This would likely not be an issue in the regime of an excess of data, but given the scarcity of data (especially diverse data) it would be good to remove the possibility of this having an effect.

Alcohol-Ester (+Acid)

The initial data set selection involved selecting a set of six common alcohols (methanol, ethanol, propanol, isopropanol, isobutanol and tert-butanol), and for each alcohol selecting approximately two esters, one larger and one smaller, for which both and data is obtainable. The components selected according to this criteria are then intended to be used as the components in the pure data only study.

Extension to Extra Functionalities

To ensure that the initial results generalise to other classes of systems, we aim to extend the study from alcohols and esters (two polar classes of molecules, which are h-bond donor / acceptors and h-bond acceptors respectively) to a broader spectrum of functionalities.

In choosing the extra data, we again enforce the restriction that all molecules chosen must have available all of , , . Given difficulties in optimising against as highlighted by the alcohol-ester studies, we choose to remove this property from the optimisations and only consider it as part of benchmarking.

The number of systems which meets this criteria is somewhat limited, and will somewhat restrict which extra functionalities may be included. Ideally, at least three extra classes of mixtures will be considered:

Table 1. The number of unique systems for which the different types of data (and combinations of such) are available. The table includes counts for all combinations of mixtures containing a spectrum of different functionalities (ester, ketone, etc.). Only mixture types with at least five and data point are shown.

Environment 1

Environment 2

Hmix(x)

rho(x)

Hmix(x) + rho(x)

ester

halogenated

123

184

123

ester

alkane

104

92

73

ether

aromatic

46

143

30

alcohol

ester

48

134

28

alcohol

aromatic

49

212

25

alcohol

heterocycle

28

146

24

alcohol

ether

36

136

23

aromatic

heterocycle

30

100

23

alcohol

alkane

38

105

21

aromatic

aromatic

49

121

21

halogenated

amide

24

37

20

aromatic

alkane

35

130

16

ketone

heterocycle

17

28

15

alcohol

alcohol

44

193

15

ether

halogenated

17

66

13

amine

aromatic

14

35

12

halogenated

heterocycle

21

53

11

ketone

amine

20

13

10

ether

alkane

34

51

9

ether

ketone

9

14

9

amine

heterocycle

11

45

8

amine

alkane

16

39

8

heterocycle

heterocycle

11

16

7

ester

ester

10

25

6

amide

aromatic

13

62

6

alcohol

amine

10

125

6

alcohol

halogenated

7

92

6

ester

aromatic

7

99

6

heterocycle

alkane

22

40

5

alcohol

ketone

38

29

5

As is shown in Table 1., there is only a limited selection of mixture types for which there is the requisite data available. In particular, there is only a limited selection of mixtures for which there are 10 or more unique substances (the number used for the alcohol-ester study) - the aim is to include at least 10 (or as close to this as possible) unique substances per set of interaction types in attempt to ensure the results are significant.

As such, the most promising functionalities to include which meet the above threes class of mixtures would be:

The available molecules to select from are listed in the following attachments:

Parameters to Optimize

Alcohol-Ester (+Acid)

Data sets containing only alcohols, esters and acids (containing only C, H and O) will exercise a total of 18 SMIRNOFF parameters exercised (9 different smirks patterns). For these studies we will keep fixed any overly generic hydrogen parameters, as well as the [#1:1]-[#8] parameter which should stay fixed at epsilon=0.0, allowing a total of 12 parameters (for 6 different smirks patterns) to be optimized.

Exercised but Won’t be Refit:

Will be Refit:

Only Pure Data

We will perform a set of optimizations including only pure densities and enthalpies of vaporization as our baseline study, given that this is what has been historically used in force field development.

We will train against the same set of compounds as we use in the mixture study, and for a single and data point for each measured at as close to ambient conditions as possible in an attempt to keep variability as low as possible. This decision was made in order to keep the data set consistent, and to keep focus on the predictive power of the respective data types rather than which molecules were included.

While Density data for these compounds will be sourced from ThermoML, no corresponding enthalpy of vaporization data is available within the data collection. As such, we intend to source the data externally, using DIPPR as guide to which data to select, but ultimately retrieving data directly from the original publication source so as to avoid infringement issues.

While arguably this training set will be much smaller than the mixture data set, and hence may lead to overfitting, it is challenging to expand this set much further while maintaining the desired systematic nature of the study. Further, any extra data to include in this study would need to be manually sourced from the literature given the in-availability of data in ThermoML. In principle however, this baseline study can be expanded to include compounds outside of the mixture data training sets if it is found that the pure training set is too small to constitute a fair study.

Chosen Alcohol-Ester Data Set

Only Mixture Data

We will conduct three sets of independent optimizations on:

  1. a training set which includes only binary + data points

  2. a training set which includes only binary + and data points

  3. a training set which includes only binary + data points

In principle and + should provide the same information content (given that they are linear combinations of each other), however they will differ slightly in their contributions to a least squares objective function (and gradient thereof), whereby the information content of + may be higher as it explicitly includes contributions of pure densities, while these are implicit in the case of . By doing both optimizations we aim to determine whether there is any practical difference between the two.

The + without study will enable us to explore whether the pure densities contribute significantly to constraining the optimization, or whether including by itself would be sufficient. If this is the case:

Initially we plan to include 10 pairs of molecules in the training set (~ 2 different esters for each different alcohol to include), and 3 different composition (25% 50% and 75%) per pair of molecules (~60 data points in total). This may be expanded if it is found that this set is too small to offer any significant insight.

Chosen Alcohol-Ester Data Set

Pure + Mixture Data

We will conduct three sets of independent optimizations on:

  1. a training set which includes only binary + and + data points

  2. a training set which includes only binary + and + data points

  3. a training set which includes only binary + and data points

Note: We may not end up conducting a number of these studies depending on the results of the Only Mixture Data studies.

We aim to use these sub-studies to explore whether pure data is needed to sufficiently constrain the optimization when mixture data is included, with a focus on whether it is important to include to constrain cohesive energies, or whether is sufficient.

The training set for these studies would be the union of the training sets used in the Only Pure Data and Only Mixture Data studies.

Chosen Alcohol-Ester Data Set

Initial Benchmark Set Selection

The initial benchmark set is expected to be modest in size (~50 pure data points, ~120 mixture data points) so as to be able to rapidly assess the performance of optimisations, but will be complemented by further sets depending on the outcomes of the initial set.

Chosen Alcohol-Ester Study Benchmark Set

The results shown in the page where generated against a data set which contained

Extended Benchmark Set

Similar to the selection of the benchmark set chosen for the initial alcohol and ester study, when selecting the extended benchmark set we opted to be less systematic in choosing the individual substances (namely, we did not require that each substance had available all of the properties of interest) and instead, focused on trying to choose as diverse a set as was possible (given a limited diversity of the available data) which maximally exercised the refit parameters.

We endeavoured to select a set which:

The test set is to contain , , , and data points for substances (both pure and binary) composed only of alkanes, alcohols, ethers, esters and ketones.

The selection of the benchmark set proceeds as follows:

  1. Filter out all of the data points which were measured for:

    1. substances that were included in the training set.

    2. molecules not composed of C, O and H.

    3. molecules with undefined stereochemistry.

    4. long chain ethers or alkanes (these are difficult to pack into a simulation box and in general take longer to simulate).

    5. molecules which contain 1, 3 di-carbonyl functionality where at least one of the carbonyl groups was a ketone. Substances containing such will likely contain mixtures of keto-enol tautomers, but the ratio in which they will be present isn’t recorded by ThermoML.

  2. Cluster all of the available , , data points based on the chemical environments present in the substances that the data was measured for (e.g. cluster all ether-ester data, ketone-alcohol data etc.)

  3. For each type of property (e.g. ) and each pair of environments (e.g. alcohol-ester):

    1. Select the substance which is ‘most distinct’ from the currently selected training set and the molecules selected for the test set of this property and environment.

    2. Repeat a. until either 10 substances have been selected, or there are no other substances to select from.

Defining ‘most distinct’

To determine how similar a substance is to another set of substances, we defined a distance metric based on a substance finger print.

For any binary substance composed of components a and b (represented as [a, b]), the substance’s finger print is defined as [f(a), f(b)] where f(x) is a function which computes the OpenEye Tree finger print of a molecule x.

The distance between any two substances ([a_1, b_1 ], [a_2, b_2]) is then defined as

min(d(f(a_1), f(a_2)) + d(f(b_1), f(b_2)),
d(f(a_1), f(b_1)) + d(f(a_2), f(b_2)))

where dis the OETanimoto distance between two fingerprints.

The distance between a given substance mixture_a and a set of mixtures mixture_set is then computed by:

  1. Computing the distance between mixture_a and each mixture in mixture_set

  2. Remove the mixture from mixture_set which is closest to mixture_a and adding the distance between the two to the total distance.

  3. Repeating step 2 until all mixtures have been removed from the mixture_set.

The most distinct molecule from the unselected set is thus chosen by:

  1. Selecting the molecule which is 'furthest' away from both the training set and the currently selected test set (which starts off empty), where the distance is defined as:

    sqrt(compute_distance_with_set(unselected_substance, training_set) ** 2 +
    compute_distance_with_set(unselected_substance, test_set) ** 2)

  2. Moving the selected molecule from the unselected set into the test set.

  3. Repeat steps 1 and two until either the target number of molecules have
    been selected, or there are no more unselected molecules to choose from.

The pure substances to include for , , where then mainly chosen as those components chosen as part of the mixture properties where available, and components close to those where not possible.

Chosen Set

The final set contains ~48 pure data points and ~ 900 mixture data points.

Questions we want to answer via benchmarking/what benchmark sets can we use to achieve this?

References

[1] Majer, V., Svoboda, V, Enthalpies of Vaporization of Organic Compounds (IUPAC Chemical Data)