2023-06-05 Internal Benchmarking Meeting notes

 Date

Jun 5, 2023

 Participants

  • @Jeffrey Wagner

  • @Lily Wang

  • @Matt Thompson

 Discussion topics

Notes

Notes

  • MT – https://github.com/openforcefield/openff-qcsubmit/issues/216 This is blocking me from downloading certain entries. Started happening in Toolkit 0.11, so it’s possible to load using earlier versions of OFFTK, then serialize to JSON, then load that using new toolkit.

    • JW – I’ll take over this one (added to Trello)

  • MT – How do people who do data engineering seriously handle large datasets? People use the word “database” to sometimes refer to an actual database, but sometimes it’s simpler things. Are there other ways that we commonly serialize things? Would love to actually learn some modern best practices.

    • LW – I’ve been looking into this, hitting some limits on lilac and such. SB and I have each rolled our own solutions, like our own little databases. These are helpful because they prevent the need to keep everything in memory. We’ve used sql to be one-entry-one-molecules, arrow-style databases (like pyarrow) to be one-row-one-model. Pyarrow is somewhat familiar and pandas-like. Though sql files compress on disk better. I definitely think that databases are the way to go for working with large datasets of molecules.

    • MT – While I think it’d be great for our infra to have a common database format, I was thinking here about the internal storage for one benchmarking run. I don’t love the design pattern where our benchmarking workflows read and writes a ton of SDF files. So I’ll try SQL first, but may also check out pyarrow to see how that goes.

    • MT – Does Pyarrow, at some level, pull the entire db from a file on disk?

      • LW – I like it because it’s memory-mapped. So you can load the file one chunk at a time. You can also load a column, for example, and index by that

    • JW – Agree that internal storage should be something efficient like SQL. But it’ll be helpful to have dump-able SDFs for debugging and plugin compatibility.

    • MT – LW, which format do you store in the DB?

      • LW – Mapped SMILES and conformers + other properties

        • I usually implement an Interchange-like SDF ↔︎ Molecule ↔︎ NAGL/other library ↔︎ DB for debugging. It means multiple steps can be involved in dumping out to SDF but it’s not too complex

      • MT – My mental model of how this would be structured is “a database of molecules”, and from some number of those columns you know everything you need for your next step. So these will be dumpable, but most things will be user-defined functions that update/add info to the database.

    • .

  • MT – (Walkthrough/demo)

    • MT – To check whether things can be parameterized, I have to run AM1BCC. This can be time consuming.

      • JW – Could we run assign_partial_charges in can_parameterize, then serialize the charges out and load them in subsequent steps?

      • MT – I’d be concerned that other changes may happen that would invalidate the partial charges assigned early on.

        • LW + JW – Since the charges are determined only by the chemical graph, and the chemical graph won’t change through the workflow, this should be safe.

        • MT – I’m not sure what users will do and this may be a dangerous thing to assume.

      • LW – Could allow user to skip charge checking?

    • LW + JW – Config-style input, seems like we’re half a step from a bespokefit-style pydantic factory/config setup. This would remove the need for argument checking and leave that up to validate functions.

      • MT – Sounds good.

    • JW – Would it be possible to store warnings/errors attached to their molecules in some way?

      • MT – I’m resistant to this. That could blow up storage size a lot. Also, I hope that the filters will keep things from blowing up.

      • JW – It’d be great to keep the stack trace of when a stage fails in the database, for debugging.

      • MT – That could make the database very very large.

      • LW – I see both sides here. Something ugly in the middle may be the ideal here… like a giant log file. But also the issue about per-molecule vs. per-conformer functionality/reporting is hard

      • JW – Maybe a different sort of database? Neo4j may more smoothly handle the relationship

      • MT – I’ll keep my eyes open for more appropriate relational databases.

    • JW – I’d be in favor of defining the plugin spec to resolve things like “do plugins raise a specific type of exception/a python exception at all?” and “can a step change the molecular graph?”

      • MT – The chemical graph change thing is just an example, but I’m using it more broadly to refer to unexpected complexity/user-defined behavior in plugins.

      • JW – This makes sense, but I’d like to make sure we do start ruling out some use cases, otherwise we’ll have a really hard time coming up with a concrete design.

      • MT – There should be constraints (like the interface must be python, etc)… but generally saying that there must be a specification for a plugin is somewhat counterproductive.

      • LW – Are you in favor of restricting what goes in and out of plugin workflow components, but not what happens inside of them?

      • MT – Kinda. Imagining the case where BSwope wants to make his own workflow component, he’d use a mostly built-in workflow, but change one step. So wrt interactions between builtin plugins and user plugins, there could be tensions about what goes in and what comes out…

      • LW – Initially we’re pitching to an experienced audience where we don’t have to say that you can’t modify the chemical graph, and I like the flexibility.

      • MT – Agree

      • JW – Sounds good.

  • MT – Big datasets are slow to handle - Both because they’re big and because we’re not optimized for speed.

    • Is there a smaller/cleaner dataset I could use? Or should I just slice through the industry benchmarking set?

      • LW – There are smaller datasets, but slicing industry set is probably best.

    • Are there points in/around out infrastructure that are heavily-used bottlenecks that we should optimize?

      • LW – from/to openeye/rdkit probably

        • rdkit especially has the stereochemistry check :(

        • JW – Agree. Maybe we can avoid some calls by not going to/from SDF

      • LW – The rdkit caching has been helpful, but I pickle things that I want to have go fast. But overall to/from rdkit has been my major problem, and I get around that using OE. The current caching/performance improvements are 90% of the way there so I haven’t noticed it too much.

 

 Action items

 Decisions