Filtering conformers for training

Background

One of the targets we use to fit valence parameters in a force field is the “optgeo” target, or the optimised geometries of molecule conformers. These conformers are generally generated to be diverse ( ), i.e. using Molecule.generate_conformers(max_confs=800, rms_threshold=1.0 * unit.angstrom)( ).

While this may cover a broad range of geometries, general diversity might not be the best solution. We may want to remove molecules with strong steric interactions, as well as molecules with intra-molecular H-bonds that result in very stable QM energies. If this helps, we would likely see an improvement on our ddE target (relative energy differences between conformers) and potentially the geometry RMSD targets.

Goal

Run experiments comparing the performance of force fields fit the current way, to force fields fit to filtered data.

To do

Task

Status

Notes

Task

Status

Notes

Filter the training and test datasets for undesired molecules

IN PROGRESS

Implemented 3 filters, may want to investigate the cutoff more

Run a force field fit

IN PROGRESS

Ran FF fit on datasets filtered with those 3 initial filters

Benchmark the results

IN PROGRESS

Benchmarked 3 initial filters

Report

Hydrogen bond filtering

To eliminate molecules with internal hydrogen bonds, I applied the HydrogenBondFilter using the Baker-Hubbard method to the geometry optimization dataset. This filter already exists and is typically applied to torsion drive datasets. I applied it to all of the conformers separately (as they are by default in the dataset), so the filter eliminates only individual conformers with internal H bonds, not entire molecules where one conformer has an internal H bond.

Sterics filtering

There are many potential approaches to filtering out conformers with steric clashes. The most straightforward way is to use the difference in Lennard-Jones energy from the lowest energy conformer as a proxy for steric clashes, and choose a cutoff that allows clashes deemed “reasonable” while eliminating those deemed to be too high-energy. There are two choices to be made, then: 1) lowest energy conformer and 2) the the energy cutoff.

1. Reference conformer

There are several possible ways to choose a “lowest energy conformer” as our reference point. We could choose the conformer with the lowest LJ energy, the lowest MM energy when all the structures are optimized with QM (as they are on QCArchive), or the lowest MM energy when all structures are optimized with a force field. In principle, it seems like using the lowest MM energy after all structures are optimized with MM would make the most sense, as this is how the force field fitting pipeline sees the structures, but we consider all three options here. One could also use the lowest QM energy, but we have decided not to pursue this for now. Once we have the lowest energy reference conformer, we calculate the difference in LJ energy from that conformer for use in the filter:

LJRef:

MMRef:

MMRef Opt (or MM opt):

2. Energy cutoff

The second choice to make is the cutoff for the LJ energy differences shown above, e.g. how high-energy of a clash do we want to keep? To answer this, we investigated different types of clashes to see how these would be treated by the respective filters.

Example conformers

First we examine some common conformations that will be instructive. These conformers have not been optimized at any level of theory, except for the columns marked “MM-opt” which have been optimized with MM (Sage 2.1.0).

For dichloroethane and butane, we would expect to want to keep all conformers, as these are pretty common motifs. Without optimization, the dichloroethane structures are nearly the same energy, however the butane structures differ by about 2.4 kcal/mol (~4 kT) from the lowest energy conformer (using either the LJ or MM energy as a reference). After optimization, however, the structures do not change substantially (evidenced by both inspection and the RMSD), yet the steric energy differences nearly vanish. This suggests that for our unoptimized energy filters we will need a cutoff of at least 2.4 kcal/mol to capture rotations about a CH2-CH2 alkane bond, while the optimization takes care of this interaction in the optimized energy filter.

Here, for the unoptimized energies and structures, the energy differences between conformers without any noticeable clashes (e.g. conformers 0, 1, 3) are all very small (< kT). For conformers with larger clashes (e.g. conformers 5 and 9), the energies are relatively large (~3.9 kcal/mol or >6 kT for 5, 6, and 8, and 7.9 kcal/mol or > 8 kT for 9). These seem like structures we could safely eliminate, given this high strain level. After optimization, all of the conformers except the most strained (9) are below 1 kcal/mol (~0.75 kT), and 9 is still well below the 2.4 kcal/mol (4 kT) cutoff that seemed appropriate for butane given the unoptimized energies. However, some of the conformers have undergone substantial relaxation, particularly conformers 3 and 4 which are qualitatively different than the unoptimized conformer. This does not affect the energy cutoff so much, although it means it is harder to assess the energy cutoff that would allow these conformations, but is something to keep in mind.

Training data set

We then investigated types of conformers in the training set that would be eliminated by different cutoffs.

Above are two examples of clashes with energies greater than 7 kT (~4.1 kcal/mol) for all reference conformers. Both of these show clashes that seem to be genuine steric clashes we’d like to avoid.

While using the optimized MM conformers to determine the lowest energy reference conformer typically lowers the energy compared to the unoptimized structures, we still may want to choose a relatively high cutoff here too. For example, the above image shows that simple planar conformations would be eliminated using a cutoff less than 5 kT (2.96 kcal/mol).

Based on this study, we decided to start with a cutoff of 6 kT (3.556 kcal/mol) for all filters, to allow for simple conformation changes but eliminate those seen in conformers 5 and 9 for hex-3-ene. However, this cutoff should be revisited, especially for the optimized filter.

Fitting the force fields

 

The training dataset was filtered in preparation for force field fitting. Note that the filter order is important as certain filters can affect which conformer might be considered the reference conformer. First, the following filters were applied to the dataset:

  • RecordStatusFilter(status=RecordStatusEnum.complete)

  • ConnectivityFilter(tolerance=1.2)

  • UnperceivableStereoFilter()

  • ElementFilter(): filtering for elements H, C, N, O, S, P, F, Cl, Br and optionally I.

      Some iodine records in older datasets cannot be used, so these are filtered out.

  • ChargeCheckFilter(): checking that am1bccelf10 charges can be assigned

Then, the following conformer-specific filters were applied, in the following order and on one processor as the ConformerRMSDFilter() and sterics filters will not perform reliably on multiple cores:

  • HydrogenBondFilter()

  • ConformerRMSDFilter()

  • The sterics filter (if any)

Sage 2.1.0 was then re-fit to these filtered datasets, starting from MSM bonds/angles and Sage 2.1.0 torsions.

Results

 

 

The overall DDE benchmarks have improved with both the MM Ref Opt filter and the LJ Ref filter, stayed roughly equivalent to Sage with the MM Ref filter, and gotten worse with just the HBond filter. This suggests that just filtering out molecules that are energy-stabilized (by H-bonds) does not work, unless you also filter out molecules that are energy-destabilized (by sterics). Interestingly, if the results are broken down into benchmark molecules where the conformer would be eliminated by a given filter (e.g. molecules in the benchmark set with internal H-bonds or steric clashes), only the HBond filter performs remotely well, whereas all of the sterics filters perform substantially worse than Sage, especially the MMRef Opt.

The structural benchmarks, RMSD and TFD, are not strongly influenced by the filters. However, when examining performance on those molecules that would be filtered out of the dataset, we again observe substantially worse performance for the force fields fit to the sterics-filtered data.

Re-benchmarking with new and improved YAMMBS

After discovering some bugs in the old YAMMBS used to perform these benchmarks, as well as determining that KDE plots were not the best way to present the data, I re-did the benchmarks. Overall, the results largely stand--the HBond + LJRef filter improves energies a bit compared to Sage 2.1.0, but all others are either equivalent or worse. I did not find much difference when filtering for molecules that would be included/excluded by the various filters; the performance was generally the same across all sets.

Those results can be found here

Discussion and next steps

These mixed results suggest that while in principle interactions like internal H-bonds and steric clashes should be treated by non-valence parameters such as van der Waals parameters, the parameters currently in Sage are not describing them well enough to eliminate their influence on the valence parameters. This isn’t so surprising, as our van der Waals parameters were trained in conjunction with existing valence parameters that did not have these effects filtered out. As a next step, we are investigating re-fitting the van der Waals parameters with the LJRef and MMOpt force fields.