Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Participants

Discussion topics

Item

Notes

General updates

  • JW – Bespokefit micromamba deployment issue is resolved

  • JW – Bespokefit not yet updated to QC* 0.50, I need to update Toolkit first. Expect this in ~january.

    • JH – The QC stuff might become time-sensitive for new datasets.

    • JH – I’m having trouble finding some old trosiondrive datasets on the new QCA. Looks like I’m using the ML server URL instead of the OFF one.

    • PB – Searching for fragment torsiondrives on QCA?

    • JH – Yes, but I wasn’t able to find them

    • PB – When you have that working, could you post how you run bespokefit with torsiondrives pulled from QCA instead of the QC generation step?

    • JH – Can do, there’s a CLI command that pulls a dataset from QCA and adds it to internal storage, which can then be pulled as a result during runtime if an identical torsiondrive is submitted.

    • DC – Will this need the new QC stack?

      • PB – I don’t think so.

      • JW – Agree, all old datasets (>6 months ago) should still be findable.

Bad-looking torsion fits

  • DC – Could you recap the problem+issue+possible solution?

  • WW – We saw this problem when we tried to do an end-state correction when trying to use ML FF, and correcting with MM FF. There was a case where we found almost no overlap between ML and MM hamiltonian. We found this particular torsion was responsible. In ANI2x it was almost always 0 degrees, but in the MM simulation it was almost always +-180. So we took this ligand, ran torsiondrive, ran a QM torsiondrive, got QM conformers in this torsion space. Then we took the QM conformers, did a restrained MM opt (where torsion was restrained), and found that there’s a big discrepancy. In the MM torsiondrive, we found that 0 deg is 1.5 kcal/mol lower than +-180, whereas in QM it was 6 kcal/mol. JH had some difficulty in reproducing, but more recently it sounds like he had some success.

  • DC – Can we establish whether this is a bespokefit error or a science error? Like, is bespokefit correctly doing what it sets out to do, but what it sets out to do isn’t good in this case.

  • JH – I think bespokefit is doing what it sets out to do. But I’m finding that the QM whole-mol energy surface is different from the QM fragment energy surface. Also seeing significant differences between OE charges vs. AT charges.

  • DM – Possibly related to ELF discussion with OpenFE?

  • JH – Possible something is being fit with OE in one stage, but AT in another?

  • WW – I don’t see any OE license warnings being printed. I am familiar with the difference and have looked for the warnings.

  • JH – (shows plot indicating difference between OE and AT torsion profile ~7 kcal/mol in OE, ~10 in AT. Also, there’s a big difference depending on whether I restrict an amide on the far side of the mol.

  • JW – I recall one theory being that the fragment could be fit well, but that the torsion parameter being used in the whole mol leads to a bad profile. Is that thought to be the case? And are these plots using parameters fit to just the fragment, or fit to the whole mol?

    • JH – This is fit to fragment, and evaluated on whole mol.

  • JH – Can set the fragmentation engine to “don’t fragment”. Then the torsiondrives will run with the whole mol.

  • WW – Would be worth trying

  • JH – Can share a workflow script without fragmentation

  • DC – Should use xTB for runtime reasons. And note that then the torsiondrive will have torsions other than the one being constrained in the training data.

  • WW – Right, that makes sense. We looked at the other torsions and they had errors around 1.5 kcal/mol, which didn’t lead to a problem in the overall sim

  • JH – Also tried constraining the amide and compared torsiondrives of the problematic torsion

  • WW – At OMSF meeting, there was discussion of optimizing bond and angle parameters. Might that be useful here?

    • JH – Possible, but the bonds and angles are relaxed in these scans.

    • WW – The QM equilibrium angles could be different from the MM equil bonds and angles. This deviation wouldn’t be the same across dihedral space.

    • DC – Right, that’s why we want to compare QM optimized geos to MM optimized geos…

  • DC – So, here I think bespokefit is doing what we intend it to do, but the fragment parameters aren’t transferable to the full mol.

Possible future scheme for bespokefit (see 5th Dec)

  • DC – We put together a plan for looking at how we could modify bespokefit to fit bond and angle parameters to reference data, as well as to work on the whole mol. That would seem to be good in this case. Idea is to generate several confs of the molecule, run MD at elevated temp using ML potential, capture multiple energies and forces, and use new fitting routine to fit to these points. This would be anticipated to solve cases like this.

  • WW – Kind of related, but separate - When people are training ML potentials, they will sometimes run sims at ~500K and grab samples, and the other is to do normal mode sampling. Have you considered normal mode sampling, or do you think 500K sim would be better?

    • DC – They started training ANI using normal mode sampling, but they moved on to using high-temp MD. I don’t know the detailed reason they did this, I suspect it’s more automated/needs less manual intervention.

    • DM – I suspect that normal mode sampling explores some less realistic structures, whereas high temp MD is kinda constrained to a reasonable guess of the energy landscape.

    • DC – Right, we want to get the ~1-5 kcal/mol regions, not to 10kcal/mol.

    • WW – What do you mean by that?

    • DC – There’s a new paper coming about this - See MACE potential – Looked at SPICE dataset which was generated by 500K MD, it turns out that even high-energy barriers were captures.

    • JH – Saw similar stuff with Espaloma training, if I recall correctly.

  • DC – Next steps?

  • WW – Should run parameterization wtih full mol, this will establish whether issue is with fragmentation.

    • JH – Makes sense. I also saw that at one point you plot mentioned QForce, does that do fragmentation?

    • WW – It does (something with neighbors, default 3 neighbors). If you look at the resulting distribution, it is mostly around 0 deg, much less pop around 180 deg.

    • JH – Does the fragmentation occur similarly?

    • WW – No,the qforce fragmentation makes much larger fragment. Includes amide on the other side.

    • JH – In this case, it looks like fragmentation truncated too early.

    • WW – details of fragmentation?

    • JH – Calculates AM1 WBO for potential fragments, tries to get bond orders in fragment that are similar to BO in parent, grows fragment otherwise.

    • WW – …

    • DM – You want fragments to be as small as possible as long as they have the same electronic structure as large mol. So you take the bond orders in the large mol, and then you make the fragments and see whether the bond orders are similar to the large mol. If not, then you grow the fragment until it becomes more similar to large mol. For parent mols without conjugation this usually leads to small fragments.

    • WW – So you’re only checking the bond order of central torsion, or bond orders of other bonds as well?

    • JH – Just the central bond/bond to be fit.

  • DC – So, next steps are to see whether varying the WBO threshold helps in this case, and then we can add this to the test set for a new scheme.

Dealing with very large datasets on QCA

  • DC – MLe is interested in training beyond bespokefit (like new charge models). The best dataset we have available through OFF is the Hartree-fock, we’re wondering about doing better level of theory. Turns out MLe has already done it. I thought it’d be good to have an initial chat.

  • MLe – We’ve calculated wavefunctions of ~350k mols, 3 confs each. Calculated in psi4. All geometry-optimized with xtb and then a singlepoint with (a very good basis+method, thought to be close to coupled cluster). However we didn’t save the results the first time, which was a big mistake, and now we’re calculating them a second time. Two potential issues are that:

    • Paper isn’t yet published

    • It’s 52TB

  • DC – Sounds about right, when we were scoping this out we estimated 2TB for 50K mols with 1 conf each. The scale is roughly right here.

  • MLe – Having multiple conformers was really useful here, saved us in some debugging.

  • JW – I could get you in contact with the maitainer of QCA.

  • MLe – Almost all the storage is numpy-compressed wavefunctions. There were tricks we could do to save a few more TB but there are decreasing marginal returns and it gets hacky

  • JH – One trick that we do with wavefunctions is that we don’t save everything, we just save orbitals. I know that’s sufficient for psi4 to recalculate other things (including density), but maybe not all.

  • DM – One thing to consider is how to best make this data reproducible and accessible. There is probably a better way to handle this than mailing hard disks.

  • DM (chat) – Two thoughts:

    1. Have you reached out to MolSSI about whether they could be enticed to host in QCArchive or a special/custom QCArchive instance?

    2. What about Zenodo? They can accommodate very large datasets in a one-off manner (otherwise by default the basic limit is 50 GB per file, but no limit on how many files there are)

    3. (1) is what Jeff is saying.

    4. Maybe it would be interesting if you coudl do some kind of hybrid thing, like … put individual molecule results in Zenodo in a machine-readable way and then have the rest of the stuff (molecules, logs, etc) somewhere else (like QCArchive) with some kind of annotation about where to recover the wavefunctions?

  • MLe – As MBIS and partial charges, those are already published on ETH archive and are publicly available. Only ~7.5GB.

  • DC – I initially pushed saving wavefunctions, since OpenFF doesn’t generally save MBIS charges. But other than being able to recalculate wavefunctions… (thinking about other applications)

  • JH – If you have MBIS partial charges but not other things, you can reconstruct them from orbitals+wavefunctions using psi4. IT just takes a while because you have to re-converge the SCF. Generally, the dataset sounds like what we want, but we will want more than partial charges.

  • DC – Also, in RESP2 approach, you need epsilon=0 and 80. This dataset uses eps=4, the choice is a bit arbitrary.

  • WW – When you say “better”, it’s clear that many things are better than H-F. But in this case, for MM applications, is an eps of 4 better than other values?

  • DC – At OMSF meeting, Bill Swope showed some plots of vacuum dipole moments for H-F vs. higher-level QM. If you think that H-F would uniformly overpolarize you’d expect all the deviation to only be in one direction, but data showed that while that was the AVERAGE outcome, there were many outliers that UNDERpolarized using H-F.

  • WW – …

  • MLe – Yeah, we’ve learned that we don’t want to do this twice, that’s why we only did it once.

  • WW – There’s an OpenFF charge method in the works (I forgot the name). Will this dataset be used to develop a fast charge tool?

    • DM – We are making a fast charge method (NAGL). We haven’t settled on what it will be trained on. As a first step it’ll be trained on AM1BCCELF10. This will make conformer-independent charges, and so should resolve some of the difference we’ve seen between the OE and AmberTools backends. Once we have a AM1BCCELF10 replacement validated for use, we’ll release it, and begin studying which higher-quality methods we can train it to.

    • DC – …

    • DM – Right, we’re doing a lot of validation to ensure we don’t do worse than before.

Action items

  •  

Decisions