We recently encountered a Kubernetes cluster that had experienced catastrophic etcd failure. We had 3 nodes running etcd that had suddenly been reduced to one without quorum. Repairing the situation required action on a number of fronts.
We had no viable backups, and had to rely on the db file that was left on the remaining etcd node.
Our main problem was the fact that the Kubernetes
kube-apiserver service was in a rolling reboot state, as it was unable to obtain data from etcd. Without a control-plane no kubernetes operations were possible. To restore the API we needed to fix the etcd issue.
Attempts to run
etcdctl commands in the etcd container we thwarted by the corrupt state of the etcd container. We couldn’t run any command to remove the failed members of the etcd cluster or inspect the data.
The Loss of quorum discussion in Andrei Kvapil’s blog post “Breaking down and fixing etcd cluster” (See blog) along with the StackOverflow post “How to start a stopped Docker container with a different command?” (see stackoverflow) provided us with enough information to solve our problem.
To return our etcd service to a working state we needed to: -
config.v2.jsonfile and add the
--force-new-clusterflag to the
etcdctlcommands could now be executed the the etcd container
config.v2.jsonfile, this time to remove the
--force-new-clusterflag in the