Fixing a broken etcd cluster | Informatics Matters

info@informaticsmatters.com

Picture of Alan Christie.

2023-05-15

POST

We recently encountered a Kubernetes cluster that had experienced catastrophic etcd failure. We had 3 nodes running etcd that had suddenly been reduced to one without quorum. Repairing the situation required action on a number of fronts.

We had no viable backups, and had to rely on the db file that was left on the remaining etcd node.

Our main problem was the fact that the Kubernetes kube-apiserver service was in a rolling reboot state, as it was unable to obtain data from etcd. Without a control-plane no kubernetes operations were possible. To restore the API we needed to fix the etcd issue.

Attempts to run etcdctl commands in the etcd container we thwarted by the corrupt state of the etcd container. We couldn’t run any command to remove the failed members of the etcd cluster or inspect the data.

The Loss of quorum discussion in Andrei Kvapil’s blog post “Breaking down and fixing etcd cluster” (See blog) along with the StackOverflow post “How to start a stopped Docker container with a different command?” (see stackoverflow) provided us with enough information to solve our problem.

To return our etcd service to a working state we needed to: -

Edit the container’s config.v2.json file and add the --force-new-cluster flag to the Args array
Restart the docker service
Ensure the etcdctl commands could now be executed the the etcd container
Edit the container’s config.v2.json file, this time to remove the --force-new-cluster flag in the Args array
Restart the docker service

latest posts

by year

2024

Kubernetes PreStop Lifecycle Hooks Molecule depiction in Squonk Squonk job execution

2023

A kubernetes volume replicator Squonk2 launch Fixing a broken etcd cluster AWS ParallelCluster v3 Custom Images Migrating to AWS ParallelCluster v3 Kubernetes object linting with popeye

2021

Installing Keycloak on Django Rest Framework Virtual screening with Parallel Cluster and Nextflow Smaller Containers - Part 4 GitHub Actions for container images

2020

Fragment Network REST API Cookie-cutting Ansible Kubernetes Projects Deploying container images from a private GitLab registry Fragment network basics Fragment network intro Redirecting to www with an nginx ingress Installing Kubernetes with Pharos Virtual screening for SARS-Cov-2 main protease inhibitors

2019

RDKit Docker Images for Centos8 Fragnet search webinar Fragment network webinar

2018

Cavities and Frankenstein molecules Building machine images with Packer Python and the Jenkins API Smaller Containers - Part 3 Applying the build process to the deployment Smaller Containers - Part 2 What is a Good Test Coverage Target? Smaller Containers - Part 1 Blog Introduction

by category

Blog

Blog Introduction

Containers

Smaller Containers - Part 4 GitHub Actions for container images Deploying container images from a private GitLab registry RDKit Docker Images for Centos8 Smaller Containers - Part 3 Smaller Containers - Part 2 Smaller Containers - Part 1

Software design

What is a Good Test Coverage Target?

Automation

AWS ParallelCluster v3 Custom Images Migrating to AWS ParallelCluster v3 Kubernetes object linting with popeye GitHub Actions for container images Cookie-cutting Ansible Kubernetes Projects Installing Kubernetes with Pharos Building machine images with Packer Python and the Jenkins API Applying the build process to the deployment

IaC

AWS ParallelCluster v3 Custom Images Migrating to AWS ParallelCluster v3 Building machine images with Packer Applying the build process to the deployment

Docking

Virtual screening with Parallel Cluster and Nextflow Virtual screening for SARS-Cov-2 main protease inhibitors Cavities and Frankenstein molecules

Fragment network

Fragment Network REST API Fragment network basics Fragment network intro Virtual screening for SARS-Cov-2 main protease inhibitors Fragnet search webinar Fragment network webinar

Kubernetes

Kubernetes PreStop Lifecycle Hooks A kubernetes volume replicator Fixing a broken etcd cluster Kubernetes object linting with popeye Cookie-cutting Ansible Kubernetes Projects Deploying container images from a private GitLab registry Redirecting to www with an nginx ingress Installing Kubernetes with Pharos

Web

Redirecting to www with an nginx ingress

Ansible

Cookie-cutting Ansible Kubernetes Projects

Python

Installing Keycloak on Django Rest Framework

Squonk

Molecule depiction in Squonk Squonk job execution Squonk2 launch