At approximately 5:48PM PDT on May 4, 2021, our Kubernetes cluster that operates ZeroTier Central and all of our network controllers lost all connectivity with our database server due to an incident in the us-west2 region of Google Cloud. This was related to this incident on Google Cloud’s Status page. As of 9:30PM PDT, the cluster has just had connectivity restored to the database, but we were able to get our core service up and running again 2 hours earlier (but not soon enough, in my book).
The first attempt to recover was just a few minutes after our cluster lost connectivity. We tried to spin up a new Kubernetes cluster in the us-west2 region and redeploy into that so we could operate in the same region that our database lived in. That failed due to the connectivity issue linked above (that we didn’t know about yet). We didn’t get confirmation that there was an issue on GCP’s side until 6:28PM PDT.
Upon confirmation of the issue in GCP us-west2, I created a read-only replica of our main database server in the us-central1 region, then promoted it to master. I then brought up a Kubernetes cluster in the us-central1 region and redeployed ZeroTier Central and all of the network controllers via our helm charts. This got the Central REST API and network controllers up and running again by 7:20PM PDT. Unfortunately, no one could log into the site yet.
Central has a fairly complex set of custom helm charts to deploy everything. I had thought I had all appropriate variable substitutions set for the database, but it turns out there were a few missing. That caused our user authentication server to not be able to connect to the new database instance until I tracked down the issue in the chart. Logins for Central were enabled again once the issue was tracked down & the helm chart was updated.
That left a few ancillary services to bring back up that weren’t in the main helm deployment charts.
Things we can and will do better in the future
Downtime is inevitable from time to time. We’ve been fortunate that we’ve had very little over the years, and what downtime we have had hasn’t been caused by us directly. That being said, this didn’t go as smoothly as I would have liked. First thing on the improvements list: Practice, practice, practice. I hadn’t had enough practice re-deploying the entire infrastructure, and wasn’t aware of a few of the pitfalls in the current charts.
Secondly, the helm charts need to be updated to fix this variable substitution for the database instance connection.
Third, the ancillary sites and services need to be moved into helm charts for deployment as well.
Finally, we need to automate updating of DNS records and CDN configuration for when things do change.
We’re sorry for the downtime tonight and hope you weren’t affected too much by it. We’ll work hard on making sure we’re back up and running faster if anything like this happens again.
May the 4th be with you.