2021-01-13 OFF Toolkit #807 Postmortem

Date

Jan 13, 2021

Participants

  • @Jeffrey Wagner

  • @Matt Thompson

  • @Jeffry Setiadi

  • @Trevor Gokey

  • @Pavan Behara

  • @Simon Boothroyd

Goals

  • Briefly recap how the bug was introduced, detected, and fixed

  • Discuss which aspects of the response were good and which could be improved, hopefully culminating in a "Major infrastructure problem" checklist

  • Discuss how to set up meaningful reference energy tests

  • Collect ideas for changes in our day-to-day working practices to prevent such errors in the future

Discussion topics

Item

Notes

Item

Notes

Recap of timeline

  • Started back in ~November

  • Allow either rmin_half or sigma transparently; need to handle either input

  • Initially only used rmin_half; converted sigma if given

  • Reworked code to keep/handle both

  • Factor of two in calculation to convert rmin_half to sigma: https://open-forcefield-toolkit.readthedocs.io/en/latest/smirnoff.html#vdw

  • First release containing bug came out on Dec 7 (OFFTK 0.8.1):

  • Next week, another version with the bug came out (OFFTK 0.8.2):

  • On Jan ??, Simon opened Issue 807 reporting energy explosions, and tracking it down to the factor of two in the vdW radius conversion. He also opened PR 808 to fix it.

  •  

Response checklist

  • This checklist should be used once we have localized a major performance-affecting bug to a single package that has been deployed to production

    • A problem is major if it:

      • Could cause a lot of compute to be wasted (significant fraction of lab compute allocation)

      • Affects external stakeholders

      • Is a silent error that causes inaccuracies in results

      • May be due to an upstream dependency update

  • Recruit at least one software scientist co-pilot to help make decisions on this checklist

  • Verify the problem

  • Decide on resolution strategy

    • Pick a “short term fix”, which prevent problem from propagating ASAP

    • Pick a “long-term fix” and release and whether a rollback to a previous stable version is acceptable

  • Notify users/affected people

    • Say who is affected, give relevant details like versions/dates

    • Decide on communication channels – Slack #general, twitter, email/mailchimp, website/blog posts, GH issue trackers, GH release page, documentation release notes.

    • Make a meta-issue handling discussion of the release/rollback. This should include likely error text caused by users trying to get the broken packages.

  • Roll back bad packages in production to last working version/stop bad results from propagating

    • Most likely, move package to broken label

  • Now that the problem has stopped propagating, take a break for ~1 hour and then meet again with your co-pilot to decide on next steps.

  • When fixing

    • If the problem can be solved by pinning away from an upstream package, simply make a new BUILD of the latest package version that pins away from the bad dependency version

    • When making the fix, add tests to catch specific issue, using reference outputs from previous “good” software versions if possible, then cut a new release.

  • Schedule a postmortem if necessary

Setting up preventative testing

  • TG – Should openforcefields tests have caught this?

    • MT – Those are only run when there are changes to OFFXML files in that repo. Also, the current tests don’t check specific energies, they just ensure that sims don’t explode.

  • Tests using nightly/development builds?

Changes to development practices?

 

Action items

Decisions