...
View file | ||
---|---|---|
|
View file | ||
---|---|---|
|
View file | ||
---|---|---|
|
Extended Benchmark Set
Loris ipsumSimilar to the selection of the benchmark set chosen for the initial alcohol and ester study, when selecting the extended benchmark set we opted to be less systematic in choosing the individual substances (namely, we did not require that each substance had available all of the properties of interest) and instead, focused on trying to choose as diverse a set as was possible (given a limited diversity of the available data) which maximally exercised the refit parameters.
We endeavoured to select a set which:
did not contain exactly on substances which were trained on, but did allow substances where individual components did appear in the training set (e.g if the training set included a mixture of pentanol and hexane, the test set would be allowed to contain pentanol and heptane and hexanol and hexane but not pentanol and hexane).
contain substances as distinct as possible from the training set, and from other molecules in the test set.
The test set is to contain
Mathinline | ||||
---|---|---|---|---|
|
Mathinline | ||||
---|---|---|---|---|
|
Mathinline | ||||
---|---|---|---|---|
|
Mathinline | ||||
---|---|---|---|---|
|
Mathinline | ||||
---|---|---|---|---|
|
The selection of the benchmark set proceeds as follows:
Filter out all of the data points which were measured for:
substances that were included in the training set.
molecules not composed of C, O and H.
molecules with undefined stereochemistry.
long chain ethers or alkanes (these are difficult to pack into a simulation box and in general take longer to simulate).
molecules which contain 1, 3 di-carbonyl functionality where at least one of the carbonyl groups was a ketone. Substances containing such will likely contain mixtures of keto-enol tautomers, but the ratio in which they will be present isn’t recorded by ThermoML.
Cluster all of the available
,Mathinline host 9e5865a8-c37e-3de7-b41a-1ad417a001db body \rho(x)
,Mathinline host 9e5865a8-c37e-3de7-b41a-1ad417a001db body --uriencoded--H_%7Bmix%7D
data points based on the chemical environments present in the substances that the data was measured for (e.g. cluster all ether-ester data, ketone-alcohol data etc.)Mathinline host 9e5865a8-c37e-3de7-b41a-1ad417a001db body --uriencoded--V_%7Bexcess%7D For each type of property (e.g.
) and each pair of environments (e.g. alcohol-ester):Mathinline host 9e5865a8-c37e-3de7-b41a-1ad417a001db body \rho(x) Select the substance which is ‘most distinct’ from the currently selected training set and the molecules selected for the test set of this property and environment.
Repeat a. until either 10 substances have been selected, or there are no other substances to select from.
Defining ‘most distinct’
To determine how similar a substance is to another set of substances, we defined a distance metric based on a substance finger print.
For any binary substance composed of components a
and b
(represented as [a
, b
]), the substance’s finger print is defined as [f(a)
, f(b)
] where f(x)
is a function which computes the OpenEye Tree
finger print of a molecule x
.
The distance between any two substances ([a_1
, b_1
], [a_2
, b_2
]) is then defined as
min(d(f(a_1), f(a_2)) + d(f(b_1), f(b_2)),
d(f(a_1), f(b_1)) + d(f(a_2), f(b_2)))
where d
is the OETanimoto
distance between two fingerprints.
The distance between a given substance mixture_a
and a set of mixtures mixture_set
is then computed by:
Computing the distance between
mixture_a
and each mixture inmixture_set
Remove the mixture from
mixture_set
which is closest tomixture_a
and adding the distance between the two to the total distance.Repeating step 2 until all mixtures have been removed from the
mixture_set
.
The most distinct molecule from the unselected set is thus chosen by:
Selecting the molecule which is 'furthest' away from both the training set and the currently selected test set (which starts off empty), where the distance is defined as:
sqrt(compute_distance_with_set(unselected_substance, training_set) ** 2 +
compute_distance_with_set(unselected_substance, test_set) ** 2)
Moving the selected molecule from the unselected set into the test set.
Repeat steps 1 and two until either the target number of molecules have
been selected, or there are no more unselected molecules to choose from.
The pure substances to include for
Mathinline | ||||
---|---|---|---|---|
|
Mathinline | ||||
---|---|---|---|---|
|
The final set contains ~48 pure data points and ~ 900 mixture data points.
Questions we want to answer via benchmarking/what benchmark sets can we use to achieve this?
...