LW – A conda installable thing that eats a force field and returns summary stats, and “just works” in vanilla mode
LW – Should have a plugin interface
JW – Preprogrammed “datasets” for each “target”, with current standard metrics (but which can be updated and should remain forward-compatible, like new versions can run historical “standard” metrics)
A Python API that can be extended to a CLI
Modular design - Workflow should easily be able to dump state to disk between stages
For force fields
OFFXML files
(optional) openmmforcefields inputs
(optional) plugin interface that would allow an eventual “grompp it” script to prepare molecules
Some way to specify metrics/targets
Electrostatics stuff
E fields, ESP grids from recharge
Geometry comparison
TFD and RMSD, from openff-benchmark
Energy comparison
RMSE from openff-benchmark
Density, Hvap, Dielectric, all the other stuff from openff-evaluator
(Optional) ForceBalance objective function
(There will be a bunch of knobs and dials for weights and other objective function calculation details)
MT – Really wouldn’t like this to be in-scope. Making ForceBalance a dependency in production adds unnecessary size/complexity.
MT – I’d like to NOT offer a combined objective function of these metrics.
JW – Agree, in a good modular design the plugins shouldn’t know about each other, and a combined objective would need to know about the other plugins.
LW – Agree for the same reasons.
For each target, some way to specify datasets:
For QC datasets,
load a whole dataset verbatim from QCA
load a whole dataset from QCA and then applying qcsubmit filters
Loading from local files
(Optional) Submit calculations to be run (superseding infra from openff-benchmark)
For phys prop datasets
Read in from csv
Read in from pandas
Read from remote archive (examples from previous work?)
For computation backends:
Instructions on how to set up evaluator for distributed compute. ideally with a small example case like “get this two-simulation job working on two GPUs and then your environment+config file should be ready to go”. This should be aimed at INTERNAL people so it can be a bit sparse/require some hacking, but it should act like a quick-start guide for a lab joining OpenFF to get evaluator running.
A docs link to this page from the openff-benchmarks material showing how to set up QCF for distributed compute in a variety of contexts
(something something submit to F@H something something, details forthcoming)
MT – Intended audience/maintenance budget?
MT – Would be against opening this to industry/making it a “public facing” package.
JW – I think anyone should be able to run the “vanilla” benchmarks, even external people (after doing the work to set up evaluator). But we wouldn’t offer support in helping make plugins/make plugins on-demand.
LW – I would gear this for internal/FF team use, and not worry too much about external usage.
MT – I’d like this to be public-facing to some extent, but support-wise we’d only really offer support for our FF team. But like I don’t want to rewrite the QCFractal deployment docs.
JW – I think the current QCF deployment docs by Dotson are perfect, and shouldn’t be rewritten. But we’ll want to provide similar docs for setting up evaluator.
LW – Thoughts on recharge?
MT – I want to make sure that I’m not doing contract work for companies - I see it as my job to ensure that OpenFF internal people are empowered to do FF fitting.
LW – I wouldn’t want to extend an implication of direct support to industry on this.
JW – I’d see a few definitions of support here:
Deployment support (“I’m having trouble running the quickstart guide” → either fix it or direct them to their appropriate technical person)
New feature support
Write me a plugin –> No
Help me write a plugin → No
Help me debug my plugin → Maybe
MT – It’s hard to enforce this sort of limitation in practice. I’ve seen this in our work with OpenEye, and we’ll want to be very deliberate to avoid it.
How should we specify FF/parameterization inputs?
MT – How can we balance getting an MVP for OpenFF use (“can we tell if a 2.1.0 candidiate is better than 2.0.0?”) with future possible needs to compare to external FFs
LW – I’m not sure that I see why we need to plan for this - We could just scope it to OpenFF forcefields initially, and then convert that into a plugin interface and swap in different parameter assignment methods as we need to add them. We’d just need to make sure that the outputs don’t change when we make the parameter assignment step into a plugin.
MT – Not sure I understand, the ecosystem/our dependencies change so much that we’ll get slightly different changes for benchmarks run at different times
LW – That’s OK, the benchmarks comparing different FFs can be run concurrently so they’re apples-to-apples.
(general) – Generally caching the outputs of parameter assignment would be highly desirable, because then we could compare apples to apples on different days. But it would be complicated because of the hierarchy nature of the data (metric → dataset → forcefield → parameterized molecules)
LW – Andrew Abi-Mansour was working on MMIC, which would support this sort of arbitrary arrangement of steps, and could support our needs here.
JW – We architected the openff-benchmark season 1 project to scale easily with the number of steps, but it was hard to scale with number of force fields and other things. So we’ll want to be very deliberate with the structure of this workflow. Especially since now we’ll have the additional dimension of input formats (SDFs vs QCA mols vs phys prop dataset CSVs).
MT – We could kinda handle this by having a number of “black boxes” going around that know how to apply parameters to a number of inputs. So like, each bit of info could carry a JSON blob describing its context. It’s important to remember that the output of this isn’t a single final version of the software, it’s a metric that we can use on an ongoing basis.
MT – I think we shouldn’t aim for infinite reverse-compatibility. So like, we wouldn’t guarantee identical numerical outputs for the same inputs in different versions of the software. This could be because, for example, a convergence criterion in an MM optimization changed.
JW – This seems natural - If folks need to reproduce things perfectly in the future, they can get pretty close by just installing old versions of the deps.
MT – But, like, we shouldn’t worry about incrementing plugin versions, even if that changes the numerical results. Also we shouldn’t make old plugins work with the new benchmarking code.
LW – I think we shouldn’t support plugins forever, and we should focus our support on internal team members. We should target a level of user that understand that using an old plugin may require using an old version of the benchmarking software.
JW – This could mean that the results from our public statements could change, and we’ll need to report our version numbers/tell people to use older versions of things to reproduce our results.
LW – That’s totally fine. Though I expect the conclusions to remain unchanged if they’re significant.
JW – Possible plugin interfaces:
Metrics (like RMSD and density)
For each metric, the datasets (like “QCA dataset XYZ” and “phys_prop.csv”
For each dataset, force fields to compare ('openff-2.1.0.offxml', ‘gaff-2.11’)
MT – The reason to create an interface is so that something can be reused. The thing here is that multiple metrics could pull from the same QCA dataset. I’m not sure
LW – I was thinking that a FF plugin would have different inputs to handle reading OFFXML folders differently than GRO folders.
MT – So there’s be a set of FF plugins - One for SMIRNOFF, one for AMBER, something like that?
LW – yes.
(general) – There’s an important black box that we need to define that eats datasets and makes serialized interchange objects
MT – This is an implementation detail, and there are other ways that this could be done.
Next steps
LW – I’m not sure that this captures my thoughts for how this would be structured - I’d like to make a diagram to show my thoughts more clearly.
MT – I’d prefer to frame this as “definition of inputs” and “definition of outputs”. I can take it from that, and then I can move quickly with a minimum of internal details that I need to work around.
LW – I’m mostly interested in being able to write and run custom metrics.
JW – Mostly worried about this being brittle/inextensible, but I think you both are good enough developers to ensure this does its job well.
We will meet at the same time next week to discuss how to seed a project, LW will present an overview of her thoughts. JW will talk less.