POST

We recently encountered a Kubernetes cluster that had experienced catastrophic etcd failure. We had 3 nodes running etcd that had suddenly been reduced to one without quorum. Repairing the situation required action on a number of fronts.

We had no viable backups, and had to rely on the db file that was left on the remaining etcd node.

Our main problem was the fact that the Kubernetes kube-apiserver service was in a rolling reboot state, as it was unable to obtain data from etcd. Without a control-plane no kubernetes operations were possible. To restore the API we needed to fix the etcd issue.

Attempts to run etcdctl commands in the etcd container we thwarted by the corrupt state of the etcd container. We couldn’t run any command to remove the failed members of the etcd cluster or inspect the data.

The Loss of quorum discussion in Andrei Kvapil’s blog post “Breaking down and fixing etcd cluster” (See blog) along with the StackOverflow post “How to start a stopped Docker container with a different command?” (see stackoverflow) provided us with enough information to solve our problem.

To return our etcd service to a working state we needed to: -

  • Edit the container’s config.v2.json file and add the --force-new-cluster flag to the Args array
  • Restart the docker service
  • Ensure the etcdctl commands could now be executed the the etcd container
  • Edit the container’s config.v2.json file, this time to remove the --force-new-cluster flag in the Args array
  • Restart the docker service
latest posts
by year
by category
Blog
Containers
Software design
Automation
IaC
Docking
Fragment network
Kubernetes
Web
Ansible
Python
Squonk