Currently, our Sage force fields do not incorporate energetics into the fitting targets, except in the form of torsion drives. We wanted to explore the effect of fitting primarily to ab initio energy targets.
Dataset preparation
To start with, I converted the Sage optimized geometry training set to ab initio targets.
Atom ordering
One challenge during this process is that the optimized geometry training set is made up of several smaller datasets, and for a given molecule, atom ordering is not necessarily consistent across the smaller datasets. This means that the order of the gradients and coordinates may not be consistent if conformers for a given molecule come from different datasets. To get around this, I checked the conformers of a molecule against a reference conformer (the first one encountered during processing) to ensure that the atom ordering and connectivity was either identical to the original molecule, or isomorphic and able to be remapped to match the reference conformer. This led to 95 conformers (out of 5043) being eliminated from the training set due to inability to remap to the reference conformer. For conformers that were isomorphic with the reference conformer, the geometries and gradients were remapped to ensure consistent atom ordering across all conformers in the training set.
Single conformer molecules and divide-by-zero errors
The ab initio target fits the relative energy between a given conformer and the minimum QM energy conformer, so there must be at least two conformers to create an ab initio target. Molecules with only one conformer can either be filtered out completely (which is the approach taken here), or set up as force-only targets, as the forces are not fit as relative quantities. Filtering only for single-conformer molecules led to 509 molecules being eliminated. This is the final dataset used for testing here.
By default Force Balance weights the energy targets by the energy difference from the QM minimum energy conformer. This can be overridden, but if this behavior is desired, degenerate conformers must be removed from the dataset to avoid errors resulting from dividing by zero. If degenerate conformers are filtered out, make sure to check whether that filtering leaves any single-conformer molecules, and treat them as single-conformer molecules are treated. Using our dataset, this led to 98 additional molecules being eliminated (though results are not reported here).
Additionally, ForceBalance filters out conformers that are too high-energy during the fitting process. If there are any molecules that have only one conformer that is below this cutoff and you are using the default FB weighting, these molecules need to be treated as single-conformer molecules and either filtered out or set to fit forces only. Using our dataset, this led to 2 additional molecules being filtered out (though results are not reported).