Response checklist | This checklist should be used once we have localized a major performance-affecting bug to a single package that has been deployed to production Recruit at least one software scientist co-pilot to help make decisions on this checklist Verify the problem Decide on resolution strategy Pick a “short term fix”, which prevent problem from propagating ASAP Pick a “long-term fix” and release and whether a rollback to a previous stable version is acceptable
Notify users/affected people Say who is affected, give relevant details like versions/dates Decide on communication channels – Slack #general, twitter, email/mailchimp, website/blog posts, GH issue trackers, GH release page, documentation release notes. Make a meta-issue handling discussion of the release/rollback. This should include likely error text caused by users trying to get the broken packages.
Roll back bad packages in production to last working version/stop bad results from propagating Now that the problem has stopped propagating, take a break for ~1 hour and then meet again with your co-pilot to decide on next steps. When fixing If the problem can be solved by pinning away from an upstream package, simply make a new BUILD of the latest package version that pins away from the bad dependency version When making the fix, add tests to catch specific issue, using reference outputs from previous “good” software versions if possible, then cut a new release.
Schedule a postmortem if necessary
|
Setting up preventative testing | |