Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

View file
namepure_components.pdf
View file
namemixture_components.pdf
View file
namefull_set.csv

Extended Benchmark Set

Loris ipsumSimilar to the selection of the benchmark set chosen for the initial alcohol and ester study, when selecting the extended benchmark set we opted to be less systematic in choosing the individual substances (namely, we did not require that each substance had available all of the properties of interest) and instead, focused on trying to choose as diverse a set as was possible (given a limited diversity of the available data) which maximally exercised the refit parameters.

We endeavoured to select a set which:

  • did not contain exactly on substances which were trained on, but did allow substances where individual components did appear in the training set (e.g if the training set included a mixture of pentanol and hexane, the test set would be allowed to contain pentanol and heptane and hexanol and hexane but not pentanol and hexane).

  • contain substances as distinct as possible from the training set, and from other molecules in the test set.

The test set is to contain

Mathinline
host9e5865a8-c37e-3de7-b41a-1ad417a001db
body\rho(x)
,
Mathinline
host9e5865a8-c37e-3de7-b41a-1ad417a001db
body--uriencoded--\rho_%7Bpure%7D
,
Mathinline
host9e5865a8-c37e-3de7-b41a-1ad417a001db
body--uriencoded--H_%7Bvap%7D
,
Mathinline
host9e5865a8-c37e-3de7-b41a-1ad417a001db
body--uriencoded--H_%7Bmix%7D
and
Mathinline
host9e5865a8-c37e-3de7-b41a-1ad417a001db
body--uriencoded--V_%7Bexcess%7D
data points for substances (both pure and binary) composed only of alkanes, alcohols, ethers, esters and ketones.

The selection of the benchmark set proceeds as follows:

  1. Filter out all of the data points which were measured for:

    1. substances that were included in the training set.

    2. molecules not composed of C, O and H.

    3. molecules with undefined stereochemistry.

    4. long chain ethers or alkanes (these are difficult to pack into a simulation box and in general take longer to simulate).

    5. molecules which contain 1, 3 di-carbonyl functionality where at least one of the carbonyl groups was a ketone. Substances containing such will likely contain mixtures of keto-enol tautomers, but the ratio in which they will be present isn’t recorded by ThermoML.

  2. Cluster all of the available

    Mathinline
    host9e5865a8-c37e-3de7-b41a-1ad417a001db
    body\rho(x)
    ,
    Mathinline
    host9e5865a8-c37e-3de7-b41a-1ad417a001db
    body--uriencoded--H_%7Bmix%7D
    ,
    Mathinline
    host9e5865a8-c37e-3de7-b41a-1ad417a001db
    body--uriencoded--V_%7Bexcess%7D
    data points based on the chemical environments present in the substances that the data was measured for (e.g. cluster all ether-ester data, ketone-alcohol data etc.)

  3. For each type of property (e.g.

    Mathinline
    host9e5865a8-c37e-3de7-b41a-1ad417a001db
    body\rho(x)
    ) and each pair of environments (e.g. alcohol-ester):

    1. Select the substance which is ‘most distinct’ from the currently selected training set and the molecules selected for the test set of this property and environment.

    2. Repeat a. until either 10 substances have been selected, or there are no other substances to select from.

Defining ‘most distinct’

To determine how similar a substance is to another set of substances, we defined a distance metric based on a substance finger print.

For any binary substance composed of components a and b (represented as [a, b]), the substance’s finger print is defined as [f(a), f(b)] where f(x) is a function which computes the OpenEye Tree finger print of a molecule x.

The distance between any two substances ([a_1, b_1 ], [a_2, b_2]) is then defined as

min(d(f(a_1), f(a_2)) + d(f(b_1), f(b_2)),
d(f(a_1), f(b_1)) + d(f(a_2), f(b_2)))

where dis the OETanimoto distance between two fingerprints.

The distance between a given substance mixture_a and a set of mixtures mixture_set is then computed by:

  1. Computing the distance between mixture_a and each mixture in mixture_set

  2. Remove the mixture from mixture_set which is closest to mixture_a and adding the distance between the two to the total distance.

  3. Repeating step 2 until all mixtures have been removed from the mixture_set.

The most distinct molecule from the unselected set is thus chosen by:

  1. Selecting the molecule which is 'furthest' away from both the training set and the currently selected test set (which starts off empty), where the distance is defined as:

    sqrt(compute_distance_with_set(unselected_substance, training_set) ** 2 +
    compute_distance_with_set(unselected_substance, test_set) ** 2)

  2. Moving the selected molecule from the unselected set into the test set.

  3. Repeat steps 1 and two until either the target number of molecules have
    been selected, or there are no more unselected molecules to choose from.

The pure substances to include for

Mathinline
host9e5865a8-c37e-3de7-b41a-1ad417a001db
body--uriencoded--\rho_%7Bpure%7D
,
Mathinline
host9e5865a8-c37e-3de7-b41a-1ad417a001db
body--uriencoded--H_%7Bvap%7D
, where then mainly chosen as those components chosen as part of the mixture properties where available, and components close to those where not possible.

The final set contains ~48 pure data points and ~ 900 mixture data points.

Questions we want to answer via benchmarking/what benchmark sets can we use to achieve this?

...