Infrastructure Architecture

Infrastructure architecture planning and long-term decision making

Proposals

Contributor

Date

Proposal

Comments / Feedback

Contributor

Date

Proposal

Comments / Feedback

@Simon Boothroyd

Feb 12, 2020

Before I finalize everything with releasing the re-branded OpenFF Evaluator framework and commit to the new API naming conventions, I wanted to suggest we should invest some time to cleanup the software stack offered by OpenFF.While everything exists under the same GitHub org, there is almost no consistency between our packages. This will only get worse over time, and equally, will only get much harder to reverse as the user-base expands.i.e currently we have

from evaluator import ... from openforcefield import ... from cmiles import ... ...

while it would be much more cohesive to have an overall architecture similar to

from openff.evaluator import ... from openff.toolkit import ... from openff.fractal import ... ...

In practice this seems obtainable through an implicit namespace file structure like https://packaging.python.org/guides/packaging-namespace-packages/#native-namespace-packages while still maintaining individual repositories. This style of architecture / design would seem to lend itself to creating smaller, more focused repo's / packages (similar to more of a set of software 'microservices').I understand this would initially cause a large amount of disruption and possible confusion among users, but the end result would be a cohesive, elegant stack, with all the software we build being connected and identifiable under the same umbrella. Moreover, I believe it would push us to build software which more rigidly follows a single responsibility pattern, rather than monolithic packages which 'do everything' which the toolkit seems to be heading towards (especially if it simply just absorbs things like fragmenter and the QC submission frameworks).

It would be fantastic to start moving away from a style similar to a zip file of disconnected tools, and to start planning longer term about how we want our software to look and be interacted with.

Originally posted in Slack

@Karmen Condic-Jurkic: Matt Thompson could be potentially used for this task.

@Jeffrey Wagner @Simon Boothroyd : We should plan to do this at the May hackathon – Let’s spec out exactly what the namespace will look like and which packages go where

@Jeffrey Wagner @Joshua Horton @Jaime Rodríguez-Guerra (Deactivated)

Feb 14, 2020

We should add a LRU cache to ToolkitWrapper (or ToolkitRegistry ) to record the outputs of common time-consuming processes like to_smiles, compute_partial_charges , find_smarts_matches, and assign_partial_bond_orders, which maps all of the inputs to these functions (so, the molecule graph hash, conformers if applicable, other kwargs) to a cached result.

From @Jaime Rodríguez-Guerra (Deactivated) – Python library for this https://cachetools.readthedocs.io/en/stable/

Link to context: https://openforcefieldgroup.slack.com/archives/C8NE3J96U/p1581697458078600

 

@Jeffrey Wagner

 

The OFFTK’s reuse of Python built-in exceptions, and its try/except logic for handling external toolkits is dangerous and causes ambiguity that can obfuscate the real source of problems. We should make a new file containing a new Exception hierarchy that inherits only from Python’s Exception at the very root of the tree, and is differentiated everywhere else.

 

@Jeffrey Wagner

 

OFFTK 1.0 release major refactors

  • Remove aromaticity setters, wire up consistently-enforced aromaticity percievers

  • Make our OWN stereo definitions, and make cheminformatics toolkits abide by them.

    • Note that, if we just naively “strip nitrogen stereochemistry”, then we might strip the features of a molecule that make OTHER atoms stereogenic, which will be complicated.

  • All charge-assignment methods become sub-entries of the Electrostatics tag, possibly the same for bespoke vdW parametrization

  • Resolve ToolkitWrapper.from_object's use of private FrozenMolecule._add_atom method

  • Sanitization/sanity checking in from_networkx? In other cheminformatics-toolkit-independent deserializations?

 

@Matt Thompson

Aug 28, 2020

We should establish some guidelines for defining what are required and optional dependencies. The two major users of any of our software products are scientists who want to use our software to do science and CI bots that run test suites. Their needs clash - bots need to install everything to run the full test suites, but scientists may only use a small portion of the codebase to accomplish their tasks. It is expensive (in computer and human time) to just list everything as required dependencies for all users since that bloats conda environments (at present and over time) and increases the likelihood of dependency issues as upstream maintainers break API and/or abandon projects. The maintenance burden can be slightly reduced with fewer required dependencies, as it allows fewer problems when building new releases.

For each package, we should aim to define a core set of use cases that must be supported “out of the box”. This helps clarify which dependencies qualify as required and, by deduction, which qualify as optional dependencies. For example, the OpenFF Toolkit needs OpenMM to export a topology and force field to a simulation; that should definitely be a required dependency (until a possible future date in which there are alternatives to consider). But functionality like molecule visualization (NGLview) and QCArchive interoperability use extra dependencies (nglview, qcelemental, etc.) and aren’t as likely to be used by most users, so could be moved to optional dependencies.

There could be reasons to have more that two lists of dependencies for each package. Thinking about them as concentric circles, maybe some package has a “core” set of dependencies that are always needed, another circle out from that which includes the core but also other dependencies that are not strictly necessary but commonly used, and then another bigger circle that encompasses everything. It’s simplest to think about it as two lists/circles, but can take other shapes.

Another gray area that’s unclear to me is what examples should be run-able “out of the box,” i.e. with nothing but a conda one-liner. It may be appropriate to deal with this at the level of each example; if we can keep the required dependencies of a package light, some examples may have as their first cell “run these conda commands to install these other packages”