Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • We will be less systematic in the selection of those systems to include in the benchmark set, opting instead to aim to curate a set which has a diverse set of molecules with pure density, enthalpy of vaporization data points, and binary enthalpy of mixing, of excess molar volume, and binary mass density data points, without enforcing that substances must have all such be available to be included (as was the case for the training sets).

  • In order to test how well each of the different produced force fields generalise, we initially aim to include binary mixtures of alcohols and alcohols, alcohols and esters (/ acids), and esters (/acids) and esters (/acids).

    • This will likely be expanded to ethers and other such additional moieties, however this will be done after this initial set has been benchmarked against.

  • In an attempt to ensure that we are testing the performance of the refit parameters, rather than the full Parsley 1.0.0 force field, we will exclude any

    • aromatic compounds

    • compounds containing 3-4 membered rings.

    • compounds containing alkane chains greater than 6 atoms in length.

    again, this will likely be relaxed in future benchmark sets.

  • This set will only contain mixtures whereby neither of the components appear in the training set. Future data sets may then be complement with mixtures which do partially contain training data to further explore interesting results highlighted by this initial set.

Chosen Alcohol-Ester

...

Study Benchmark Set

The results shown in the page where generated against a data set which contained

  • ~110 pure data points (~60:40 split of

    Mathinline
    host9e5865a8-c37e-3de7-b41a-1ad417a001db
    body\rho(pure)
    and
    Mathinline
    host9e5865a8-c37e-3de7-b41a-1ad417a001db
    body--uriencoded--H_%7Bvap%7D
    , ambient condtions) where none of the substances appeared in the test set.

  • ~320 mixture data points (with roughly equal numbers of

    Mathinline
    host9e5865a8-c37e-3de7-b41a-1ad417a001db
    body--uriencoded--H_%7Bmix%7D
    ,
    Mathinline
    host9e5865a8-c37e-3de7-b41a-1ad417a001db
    body\rho(x)
    and
    Mathinline
    host9e5865a8-c37e-3de7-b41a-1ad417a001db
    body--uriencoded--V_%7Bexcess%7D(x)
    data points, ambient conditions, three compositions (~25%, ~50%, ~75%) per pair). This included mixtures where neither component appeared in the test set, and alcohol-alcohol and ester-ester mixtures where both components appeared in the test set (the train set only included alcohol-ester mixtures)

View file
namepure_components.pdf
View file
namemixture_components.pdf
View file
namefull_set.csv

Extended Benchmark Set

Similar to the selection of the benchmark set chosen for the initial alcohol and ester study, when selecting the extended benchmark set we opted to be less systematic in choosing the individual substances (namely, we did not require that each substance had available all of the properties of interest) and instead, focused on trying to choose as diverse a set as was possible (given a limited diversity of the available data) which maximally exercised the refit parameters.

We endeavoured to select a set which:

  • did not contain exactly on substances which were trained on, but did allow substances where individual components did appear in the training set (e.g if the training set included a mixture of pentanol and hexane, the test set would be allowed to contain pentanol and heptane and hexanol and hexane but not pentanol and hexane).

  • contain substances as distinct as possible from the training set, and from other molecules in the test set.

The test set is to contain

Mathinline
host9e5865a8-c37e-3de7-b41a-1ad417a001db
body\rho(x)
,
Mathinline
host9e5865a8-c37e-3de7-b41a-1ad417a001db
body--uriencoded--\rho_%7Bpure%7D
,
Mathinline
host9e5865a8-c37e-3de7-b41a-1ad417a001db
body--uriencoded--H_%7Bvap%7D
,
Mathinline
host9e5865a8-c37e-3de7-b41a-1ad417a001db
body--uriencoded--H_%7Bmix%7D
and
Mathinline
host9e5865a8-c37e-3de7-b41a-1ad417a001db
body--uriencoded--V_%7Bexcess%7D
data points for substances (both pure and binary) composed only of alkanes, alcohols, ethers, esters and ketones.

The selection of the benchmark set proceeds as follows:

  1. Filter out all of the data points which were measured for:

    1. substances that were included in the training set.

    2. molecules not composed of C, O and H.

    3. molecules with undefined stereochemistry.

    4. long chain ethers or alkanes (these are difficult to pack into a simulation box and in general take longer to simulate).

    5. molecules which contain 1, 3 di-carbonyl functionality where at least one of the carbonyl groups was a ketone. Substances containing such will likely contain mixtures of keto-enol tautomers, but the ratio in which they will be present isn’t recorded by ThermoML.

  2. Cluster all of the available

    Mathinline
    host9e5865a8-c37e-3de7-b41a-1ad417a001db
    body\rho(x)
    ,
    Mathinline
    host9e5865a8-c37e-3de7-b41a-1ad417a001db
    body--uriencoded--H_%7Bmix%7D
    ,
    Mathinline
    host9e5865a8-c37e-3de7-b41a-1ad417a001db
    body--uriencoded--V_%7Bexcess%7D
    data points based on the chemical environments present in the substances that the data was measured for (e.g. cluster all ether-ester data, ketone-alcohol data etc.)

  3. For each type of property (e.g.

    Mathinline
    host9e5865a8-c37e-3de7-b41a-1ad417a001db
    body\rho(x)
    ) and each pair of environments (e.g. alcohol-ester):

    1. Select the substance which is ‘most distinct’ from the currently selected training set and the molecules selected for the test set of this property and environment.

    2. Repeat a. until either 10 substances have been selected, or there are no other substances to select from.

Defining ‘most distinct’

To determine how similar a substance is to another set of substances, we defined a distance metric based on a substance finger print.

For any binary substance composed of components a and b (represented as [a, b]), the substance’s finger print is defined as [f(a), f(b)] where f(x) is a function which computes the OpenEye Tree finger print of a molecule x.

The distance between any two substances ([a_1, b_1 ], [a_2, b_2]) is then defined as

min(d(f(a_1), f(a_2)) + d(f(b_1), f(b_2)),
d(f(a_1), f(b_1)) + d(f(a_2), f(b_2)))

where dis the OETanimoto distance between two fingerprints.

The distance between a given substance mixture_a and a set of mixtures mixture_set is then computed by:

  1. Computing the distance between mixture_a and each mixture in mixture_set

  2. Remove the mixture from mixture_set which is closest to mixture_a and adding the distance between the two to the total distance.

  3. Repeating step 2 until all mixtures have been removed from the mixture_set.

The most distinct molecule from the unselected set is thus chosen by:

  1. Selecting the molecule which is 'furthest' away from both the training set and the currently selected test set (which starts off empty), where the distance is defined as:

    sqrt(compute_distance_with_set(unselected_substance, training_set) ** 2 +
    compute_distance_with_set(unselected_substance, test_set) ** 2)

  2. Moving the selected molecule from the unselected set into the test set.

  3. Repeat steps 1 and two until either the target number of molecules have
    been selected, or there are no more unselected molecules to choose from.

The pure substances to include for

Mathinline
host9e5865a8-c37e-3de7-b41a-1ad417a001db
body--uriencoded--\rho_%7Bpure%7D
,
Mathinline
host9e5865a8-c37e-3de7-b41a-1ad417a001db
body--uriencoded--H_%7Bvap%7D
, where then mainly chosen as those components chosen as part of the mixture properties where available, and components close to those where not possible.

Chosen Set

The final set contains ~48 pure data points and ~ 900 mixture data points.

View file
namepure_set_componentsegfsefgse.pdf
View file
namemixture_componentsexcess_molar_volume_binary_egesges.pdf
View file
namefull_set.csv

Extended Benchmark Set

...

enthalpy_of_mixing_binary_fefgsef.pdf
View file
namedensity_binary_qwffqewfe.pdf

Questions we want to answer via benchmarking/what benchmark sets can we use to achieve this?

  • In general, to what extent are LJ parameters trained on mixture data data transferable to other sets of mixture?

    • The first mixture benchmark set (MB1), consisting of heat of vaporization, mixture density, and excess molar volume of alcohol/ester, alcohol/alcohol, alcohol/acid, ester/acid, acid/acid and alcohol/ether mixtures that do not have any commonality to the mixtures that we trained on.

  • If we train LJ parameters on one type of mixture, how well do those parameters transfer to other types of mixture?

    • Since we included only alcohol/ester mixtures in our training data, MB1 will allow us to look at the transferability of LJ parameters to other types of mixtures.

      • Alcohol/alcohol mixtures: We train on mixtures that contain alcohols, but only mixtures of alcohol with esters. To what extent do the alcohol LJ parameter transfer to mixtures that only include alcohols.

      • Alcohol/ether mixtures: To what extent do the LJ parameters for alcohols, trained on ester mixtures, transfer to mixtures of alcohols and ethers (which have not been trained on at all)? Ethers should be similar to esters, so if they do quite poorly, this will be an issue.

  • How do benchmarked mixture properties vary as a function of composition?

    • Through benchmarking, can we identify the extent that transferability affects mixture properties as a function of composition? For example, if we test against alcohol/ether mixtures and we see worse performance as the ether concentration increases, then maybe the alcohol parameters are good, but the ester parameters don’t transfer well to ethers.

    • MB1 will allow us to explore this, since we should have good coverage in a range of xA=0.2-0.8. We should also consider adding some points in the 0.9-0.95 mole fraction region, to check on that behavior. This could either be as a separate set, or just something we break out in MB1

  • Can we identify a “spectrum of transferability” for parameters in mixtures.

    • For example, within a benchmark set composed of mixtures that have at least one component in the training set (MB2), there are a large number of mixture with tert-butanol. Assuming that tert-butanol is parameterized reasonably well, by examining the mixture properties and chemical similarity of the other moieties in these tert-butanol sets, can we identify how different a mixture can be before the transferability starts to degrade?

  • To what extent are mixture properties transferable from training on pure properties only? To what extent are pure properties transferable from training on mixture properties only?

    • Can we get a sense of the correlation between performance on mixture properties and pure properties (does low error in pure density imply low error in mixture density)? By benchmarking sets parameterized with only pure and only mixture data on MB1 and PB1 (the basic pure data benchmark set), we can analyze this.

  • How do mixtures trained on mixture densities perform on excess molar volumes?

    • By looking at the subset of MB1 that includes excess molar volumes, can we accurately reproduce these by training on mixture densities? If we can’t, that may point to excess molar volumes not being very useful for us.

...