/
2021-06-07 Core developers Meeting notes

2021-06-07 Core developers Meeting notes

Participants

  • @Jeffrey Wagner

  • @Pavan Behara

  • @Andrew Dalke (Deactivated)

  • @Iván Pulido

  • @David Dotson

  • @Matt Thompson

  • @Simon Boothroyd

Discussion topics

Item

Notes

Item

Notes

Updates

  • IP

    • Working on extracting standard residue substructures from chemical components dictionary CIF file, and applying residue/atom names to cheminformatics representations of molecules.

    • Started interfacing with atom metadata dictionary, and populating it with the info from the substructures.

      • AD – Which tools do you use to make the substructure SMARTS patterns?

      • IP – Use openff toolkit features. There’s already tools that can do the smarts matching. Once we have the SMARTS + atom types, we serialize them to a JSON.

      • AD – Specifically, for a SMARTS that matches a proline residue, how is that generated?

      • IP – Starting with CCD, we parse the entries and turn them into SMARTS. There’s a little bit of messing around with protonation states, kekule forms, and leaving atoms. So we sometimes need to change things like explicit bond orders to be “any order” if there are many kekule forms

      • AD – I’ve had trouble with SMARTS patters, where patterns can be ambiguous if they don’t have explicit Hs and numbers of connections.

      • JW – Could change out SMARTS generation code to specify exact number of fonds and formal charges of each atom.

  • AD

    • Learning about toolkit API. Working on some parallel tasks:

      • Minidrugbank is being used for a lot of tests. I’m trying to figure out whether we need to be using all of them. There are a few ways we could reduce them. One way is to determine whether the different toolkits perceive the same input as different molecules. We also want to find all the unique outcomes of reading these molecules. So we can define features based on whether both toolkits interpret the input the same way, and also look at the code coverage for the process of reading each molecule. Then each molecule can be assigned features like “RDKit raises this warning” or “this runs the following lines of code in the toolkit wrappers”. Then we can use a solver library called z3 to find a minimal set of molecules that exercises to all the unique paths in the codebase.

      • DD – I saw some discussion about size of the repos and the committing of test files. Would it make more sense to have them in a separate repo?

      • AD – We haven’t discussed this yet. Shallow copies might be a solution to large repo size.

      • AD – I’ve been thinking about the time required to run the tests. It takes 4 minutes just to run test toolkits. Also, we frequently have to wait a long time for CI to run. So we could do stuff like using zip files instead of tar.gz, which could cut down on molecule reading time. However this would add to the repo size

      • JW – I’m interested in reducing the scope of our test molecules, though I don’t want our git repo size to increase. So I’d like to couple this with a history cleanup

    • AD – Are there other datasets that would be useful for me to include in this analysis?

      • IP – We could use the CCD as a source of diverse inputs

      • SB – The enamine set, NCI diversity set would be good.

      • AD – I’ve been using the chEBI set

      • JW – I’m also interested in different cases of molecule generaiton, like spec breaking input molecules

      • AD – Let’s discuss this later

    • IP – What features are you looking to minimize from the input sets?

      • AD – Right now it’s to answer an engineering question. For example, this should help us catch whether RDKit and OpenEye see the same number of Hs on the same input atom. Other times, the ToolkitWrappers will perceive different stereochemistry. Other features include code coverage, where each molecule records which lines of source code it enables. In the future, we can manually add features, like “ensure we include all the residue types”

      •  

  • SB

    • Not much software.

    • Worked with MT on how electrostatics are handled. So, now the the System’s ElectrostaticsHandler consumes the partial charge assignment handlers.

  • MT

    • A lot of work on the electrostatics stuff above. This was a sizeable refactor, and it uncovered other problems that needed to be resolved. Ended up changing a lot of code, but now things run a bit more elegantly.

    • Now there’s a nicer mapping between toolkit handlers and system handlers. Now many toolkit handlers map to the system electrostaticshandler, some messiness around constraints and bonds needing to know about each other.

    • Started splitting out OMM NonbondedForce generated by System into 4 separate forces. I’m not keen on porting this to the toolkit.

    • Openff-utilities and openff-units are both now released on conda-forge. Now can mostly do roundtrips to/from simtk units, though this may not work in all cases. Seems to be sufficient for Jeff S.

    • Worked on converting charge_from_molecules to a librarycharge-based approach.

    • Working with Trevor on some toolkit PRs about virtualsites and parameter vectorization. Both PRs that I’m reviewing seem to be in good shape.

    • SB – Re: deprecating charge_from_molecules – I don’t think this will work because of the size of SMARTS/substructure matches.

      • JW – We should look more closely at this. The current charge_from_molecules list does do a substructure match, but it’s somehow doing it much faster than the SMARTS amtching in the current librarycharges-based approach

      • SB – We should avoid deprecating charge_from_molecules until this runtime issue is resolved.

  • DD

    • Have 6/10 partners with public results submitted. Planning to have the next partner call as soon as Gary can schedule it, will probably be 2+ weeks out.

    • Working with Lorenzo on some new analysis that B Swope and X Lucas wanted. Will also work on advancing the executors in openff-geopt.

    • Re-rolled industry public benchmark set to include corrected molecules from Merck. New set is “version 1.1”. Will consult with JH to ensure that the outdated molecules are excised, but this is more for bookkeeping, since the new molecules have already been submitted.

    • On QCArchive front, we’re implementing support for doing many restarts at once, as opposed to having to do them one at a time. This will help with GH actions runtime. Still need to test this for tag/priority control using labels. Could allow job control using regexes

    • Spent time working on QC client-side logic, including things like caching. Probably won’t be available immediately – will need to wait until next release. Also looking to remove direct use of pydantic objects for API, instead it will live as the data model which is hidden below an API layer.

    • On PLBenchmarks, did an initial round of design review. Got some good feedback, eg SB recommended adding a web API. Will be refining this and doing another round soon.

  • JW

    • Working with AD to define the current state of our toolkit wrappers and make a safe way to locate ourself and navigate through “cheminformatics toolkitwrapper space”.

    • Working with Ivan on biopolymer support; now have caching that speeds up SMARTS matching for large molecule numbers

      • if we expose some of this as public methods, could solve other folks' problems as well

      • some of the methods used for caching (e.g. hashing (order dependent, connection table)) more broadly usable

    • Also added an atom metadata dictionary

      • can strictly have keys that are strings, values that are strings, integers, floats

      • were originally going to do this with Pydantic, but not sure about our desired behavior yet, so premature to do this without much benefit

    • Worked with Josh Mitchell to get two new examples up

      • would like to replace our 12 notebooks with 4 high-quality, maintainable notebooks

        • existing notebooks aren’t clear cookbooks or showcases

      • check out PRs to see status:

    • Working on keynote workshop talk as primary concern this week

  • PB

    • Mostly debugging sulfonamide parameters. Trying to find an example where I start with good initial parameters, but wind up with bad values. So far I can only get bad final values if I start with bad initial values. Doing some experiments on sulfur, but it’s still inconclusive. Other than having a better set of initial value, I can’t get it to converge.

    • Including dihedral RMSDs in objective function can makes it do better, but I still don’t see it converging to the expected values. It’s a bit complicated with restraints in these cases.

    • For QM theory benchmarking, I’m working on some analysis scripts to help with the study. Some of the datasets are incomplete. JH and DD are working on this. After the first round fo analysis, I’ll also check into how subsequent datasets could be more informative

    • JW – Just FYI, we might ask you to talk for ~15 minutes about your experience with FF debugging in one of the workshop followups.

  • JW – SB, should we prioritize a release now that the vsites PR is merged?

    • SB – Let’s check with TG on that.

  • MT – Which next major objective to work on?

    • Still a big future feature list for system roadmap, not clear which ones are highest-priority.

    • It’s been helpful when SB has specific needs and I can build this out

    • Could keep working on tightening energy tests, but I don’t think this is the most useful thing to be doing.

    • Setting up protein-ligand simulations

      • MT – This is waiting on biopolymer infrastructure. It’s not clear when I should start building towards a particular state of that.

    • Fitting infrastructure – SB and MT already worked together a lot on this, can replace a lot of our use of FB with some of the new progress.

      • SB – Next step might be parity with toolkit, where the major headache will be vsites. Then, after that, removing duplicated code from the toolkit and putting it in the system.

      • MT – I could start scoping out the architecture for vsites, though I’d need to study the smirnoff spec a bit. Also not sure exactly where vsite code would finally live.

      • JW – Now that we have some working system code and an understanding about the architecture/constraints, we could revisit the question of where vsites “live” in the system.

      • MT – Agree. This is something we can start looking into.

      • SB – I’d be happy to engage in discussions on this. Could also use input on how to extend the SMIRNOFF spec to better handle plugins/custom functional forms.

      • SB – I’m curious to see how various simulation engines handle vsites – Are they all treated as particles? Or are some just assigned parameters/other representations? Could be a good topic for a survey.

        • MT – I’ll do that.

    • SB – Also interested in making the system object a “first class citizen” in openff infrastructure. This would be both in terms of exposure in the toolkit API and in publicity. Specifically, if someone asked “if I want to do X, which tools should I use for this?”. Like “if I want the assigned charges, how do I get them?” Right now the answer is “get them from the openmm system”, and we’d like to change this to “get them from the openFF system”.

      • MT – I think there’s also some meaning that we communicate with our packaging dependencies and import paths.

    • MT – How do we want to roll out the system object?

      • JW – It will be partly “how do we publicize/tell people to use it?” and “when is it in a ‘production-ready’ state?”

      • MT – On the second point, this will be hard to define. A lot of our current ecosystem has major known bugs and isn’t what would broadly be defined as “production ready”, but it is in production. How do we want to deprecate use of omm systems and parmed structures?

      • JW –

      • SB – I’m envisioning the toolkit removing all of its direct calls to make/manipulate openmm systems and moving that code to the openff system. The, once we’ve done internal testing, we can start publicizing and doumenting it.

      • SB – When defining “production ready”, there are a few things I want to see:

        • If I take the NCI list, I’d expect both the old and new system creation pathways should make openmm systems that give identical energies. This would use unconstrained FF

        • For condensed phase system, boxes of mols from the NCI list should give identical energies. Doesn’t need to be a huge number of systems, but just a few would be good to test. This would use constrained FF.

      • MT – Thanks, this is the feedback I’m interested in. Would it also be worth doing the same with a solvated protein and/or protein-ligand system?

        • SB – Yes, that would be a good goal.

      • MT – Would we be concerned if the initial create_openff_system doesn’t handle vsites or WBO stuff?

        • SB – It’d be good to keep a list of this. For the torsion WBOs we could aim to put that in before the release.

        • SB – Also, custom parameterhandlers and potentialhandlers will be tricky to expose. But we can probably start steering users toward the openff system even without having these.

      • MT – Agreed.

      • JW – For the NCI energy equivalence tests, it would be good to record these energies to create a set of tuples of (input_mol, input_ff, expected_energy). This reference energy set would be useful in many areas.

      • MT – SB, you’ve already done the 250k NCI compound test. Should I take over this effort from you?

      • SB – Yes. If we pre-compute charges then this should be doable on a few hundred cores.

  • Decision – We should start putting the System into production, first by doing energy tests to ensure it’s on par with our current behavior.





Action items

Decisions