May 2020 Developers Week Notes
Date
May 11, 2020
Participants
@Jeffrey Wagner
@David Dotson
@Simon Boothroyd
Tobias Huefner
@Josh Fass (Deactivated)
@Hyesu Jang
@Jessica Maat (Deactivated)
@Jaime Rodríguez-Guerra (Deactivated)
@David Hahn
@Joshua Horton
@Matt Thompson
@Jeffry Setiadi
Goals
Discussion topics
Notes |
---|
Virtual devs week organization
Day 1:
Day 2:
|
Round table updates Jaime –
David Hahn --
Josh Horton --
Josh Fass --
Simon Boothroyd
Matt Thompson
David Dotson
Hyesu Jang
Jeff Wagner
Jeff Setiadi
Jessica Maat
Tobias Huefner
|
Development practices (Day 1) SB – Writing code fast with Owen, we didn’t do a lot of testing. I’d do PRs and no merge without reviews, and I think that was the right way to go. Other repos, espeically data-focused ones, have a big need for a quality version history, and so we need to handle it on a case by case basis. HJ – Used single jupyter notebook for generation of datasets; used PRs for QCA dataset submission. SB - The kind of PRs we were doing for the data curation / choices etc: https://github.com/openforcefield/nistdataselection/projects/1 JW - Justification for QCA submission approach - needed a way to document and keep track of what we did, be able to evolve approach over time. Can later try and capture the best approach. JH - Took a few rounds to figure out pattern in QCA submission. Information was all spread out. Needed to synthesize it all. JM - May be useful to have scripts we all use. Functions that are common for generating the JSON, etc. JW - QCSubmit should be able to capture many of the lessons we’ve learned. Are there specific things we can list that we’ve learned from this? SB - reasonable handle on what our workflow looks like; early stage, scripts okay. HJ - should sit down and think about the schema first then consider automation. Still experimenting with schema; this may be blocking progress. MT - bar for perfect would hamper progress; go for addressing current needs, and evolve with time. If we want to write code that depends on tests that rely on bond orders existing, have to go back and find where they were calculated, where they were not. SB - shape of data changing depending on algorithm; hard to define input schema; output schemas easier to define. Defining data models is inherently difficult, especially when shape of the data is constantly changing. Probably where software scientists should step in and define versioned schemas at various points, while allowing the science to continue to have flexibility in bushwacking JW - QCSubmit: would be very informative to have a walkthrough of what it does and does not do, how it does it, and go from there on what gaps need addressing
|
Namespace reorganization
(Day 1) JW:
JRGP: Might be some tricky bits at implementation level - namespace/folder collisions? How to let the user know which ones are installed? DD: Did this in datreant, some limitations, one is that you will and can never get anything out of “import openff”
JRGP: Introduce pytest command as recommended entry point · Issue #1629 · pytest-dev/pytest pytest plugins register mechanism? https://docs.pytest.org/en/2.7.3/plugins_index/index.html MT: What other orgs/projects have done something like this?
SB --
JRG --
SB --
JW: Maintaining reverse compatibility is Important SB -- Not sure about putting everything under OFF repo, since that’s associated with toolkit (General) -- Do we want “from openforcefield import” or “from openff import” (General) -- “openff” JW -- Is there anything that shouldn’t be under openff namespace or OFF github org? SB -- “research” code should be under non-openforcefield org/namespace. Only packages that we’re committed to maintaining should be under FF org/namespace. JW -- I’d like to move toward a model where anything in the github org is anything we’ve agreed to maintain or we’ve deliberately archived. SB -- agree that things in our org space should either be stamped that it’s maintained or archived. MT -- Red and yellow badges are things we feel a certain responsibility to move them toward yellow or green; the organization boundary helps determine what we will expend software scientist effort on. JRG -- I think there’s a sweet spot here according to this split; smart folks may disagree on splitting. Google has a github org, but it’s a mess. It’s a bit of a wild west. How do we pursue something that still has our name, but also gives us freedom to continue to progress?
SB -- Wasn’t sure if a different org for “OpenFF studies” makes sense; not sure where to draw the lines, how much segmentation necessary. MT -- clear delineation of what’s active and archived could go a long way. Can definitely see argument that an org with a hundred repos can look messy; but not clear at what point that is problem? JW: Software that is actively maintained stays with the org, exploratory stuff is in individuals or lab orgs. Keep an internal list of repos that lists if each is active JRG: Likes the idea of a “cluster” of organizations, with various spinoffs for different things, and keep each org fairly trim of repos. So one main one for core software, another for papers, etc. various research spinoffs. Should we come up with practices/patterns to keep up with for “outlying” repos? DD -- Overhead of categorizing all of these is hard. Maybe we should have tags like “research” “infrastructure” “archived” “dataset” that we can attach to these projects. GH org webpage can let you filter by tag. SB -- Agree. I also wonder about repo names -- Can we standardize on package naming patterns? MT -- Would that make us distinguish between software and data repos? SB -- Yes MT -- Maybe the distinction is possible Summary --
(Day 2)
(asynchronous/later in day) -- Implicit namespace work PEP420 Problems with _version.py → Need to change setup.cfg to have versioneer look in module, instead of top-level package Made implicit namespace test packages. To install: conda create -n test -c omnia/label/beta test_package_a test_package_b To run: from test_org import test_package_b.bar bar.Bar() Stubby metapackage SB -- Could define a “magic path” in metapackage __init__.py Many modules for different parts of OpenFF toolkit Ex. from openff import topology Offmol = topology.Molecule Disadvantage if metapackage --
Advantage -- Would let us split off OFFMol, OFFTop, OFFToolkitWrapper into “openforcefield-core” package, and users wouldn’t need to change import statements Could just move everything in OFFTK to openff top-level dir, but ALSO have an openforcefield directory hierarchy with stubs that redirect to the new code’s location
|
Determining best practices for QC dataset naming and organization (Day 1) SB - if we store rationale for the dataset alongside metadata (e.g. date, CMILES, etc.), that is both something you know and is valuable later. Name becomes a bit irrelevant then. JW - 2019-07-02 VEHICLe gives a good example what we probably want in a dataset README metadata-wise JH - if we took some of the metadata in e.g. that README and put it in e.g. JSON then it would be usable as metadata programmatically JW - perhaps store a URL to the README? So it’s easy to find later? Not certain of best approach. Perhaps we should include raw markdown in metadata submission? TG -- I think we should figure out a way to roll our own datasets. Kinda what’s being done now, but need to go through Ben. If we just kind of choose a dataset we want and just keep adding to it, that might be more helpful. JW -- If you have a dataset with a name, was it easy to find in qca-dataset-submission TG -- I just go to get collections in the client; I don’t use qca-dataset-submission at all -- not an accurate representation of what’s in the database. If there was one unified dataset, that would solve all issues. HJ -- Next release pull-down will be around 300-400MB TG -- In addition to that, we could make a dataset that contains the molecules that were complete at time of download for fitting JW -- So, we could look at the complete molecules that Hyesu pulled down for this release, and retroactively collect them into a dataset. This could help us refine the dataset labeling process later. TG + HJ -- Hessian dataset labeling will be complicated, but doable. (General) -- Do we want the tarball to focus on:
JW -- Are these mutually exclusive? SB -- I don’t think so. We could make infrastructure to handle all the different steps. DD -- Having a snapshot of the data + code + dependencies would be useful. TG -- A more complete solution would be to mirror/tar up a snapshot of the whole QCA when we do these fits. DD + SB -- Disagree. It should be sufficient to store the result of the QCA pulldown. To do
(Day 2) QCSubmit prototype:
|
Migrating packages over to GitHub Actions and unifying under one OE license
|
Reorganizing/defining/consolidating the many development-related slack channels
|
Deciding upon a consistent approach and theme for each repos docs
|
First Annual Devs Week Feedback
|