search by tags

for the user

adventures into the land of the command line

repair an unhealthy etcd node in a healthy etcd cluster

on a healthy node, find the unhealthy one and remove it from the cluster:

$ etcdctl --ca-file /etc/ssl/etcd/ssl/ca.pem --cert-file /etc/ssl/etcd/ssl/member-master2.pem --key-file /etc/ssl/etcd/ssl/member-master2-key.pem --endpoint "https://127.0.0.1:2379" cluster-health
member 6e8f4db2bc656917 is healthy: got healthy result from https://172.16.10.103:2379
failed to check the health of member c70f7ba744c678a0 on https://172.16.10.107:2379: Get https://172.16.10.107:2379/health: dial tcp 172.16.10.107:2379: getsockopt: connection refused
member c70f7ba744c678a0 is unreachable: [https://172.16.10.107:2379] are all unreachable
member e4434395baae163e is healthy: got healthy result from https://172.16.10.109:2379
cluster is healthy

$ etcdctl --ca-file /etc/ssl/etcd/ssl/ca.pem --cert-file /etc/ssl/etcd/ssl/member-master2.pem --key-file /etc/ssl/etcd/ssl/member-master2-key.pem --endpoint "https://127.0.0.1:2379" member remove c70f7ba744c678a0
Removed member c70f7ba744c678a0 from cluster

$ etcdctl --ca-file /etc/ssl/etcd/ssl/ca.pem --cert-file /etc/ssl/etcd/ssl/member-master2.pem --key-file /etc/ssl/etcd/ssl/member-master2-key.pem --endpoint "https://127.0.0.1:2379" cluster-health
member 6e8f4db2bc656917 is healthy: got healthy result from https://172.16.10.103:2379
member e4434395baae163e is healthy: got healthy result from https://172.16.10.109:2379
cluster is healthy

on the unhealthy node, stop etcd and delete the contents of its data dir:

$ systemctl stop etcd
$ rm -rf /var/lib/etcd/member/snap/*
$ rm -rf /var/lib/etcd/member/wal/*

on the healthy node, add a new member with the same details as before:

$ etcdctl --ca-file /etc/ssl/etcd/ssl/ca.pem --cert-file /etc/ssl/etcd/ssl/member-master2.pem --key-file /etc/ssl/etcd/ssl/member-master2-key.pem --endpoint "https://127.0.0.1:2379" member add etcd1 https://172.16.10.107:2380
Added member named etcd1 with ID 962641bc9caa06aa to cluster

ETCD_NAME="etcd1"
ETCD_INITIAL_CLUSTER="etcd3=https://172.16.10.103:2380,etcd1=https://172.16.10.107:2380,etcd2=https://172.16.10.109:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

on the unhealthy node, start it:

systemctl start etcd

on a healthy node, monitor the cluster status:

$ etcdctl --ca-file /etc/ssl/etcd/ssl/ca.pem --cert-file /etc/ssl/etcd/ssl/member-master2.pem --key-file /etc/ssl/etcd/ssl/member-master2-key.pem --endpoint "https://127.0.0.1:2379" member list
6e8f4db2bc656917: name=etcd3 peerURLs=https://172.16.10.103:2380 clientURLs=https://172.16.10.103:2379 isLeader=false
962641bc9caa06aa[unstarted]: peerURLs=https://172.16.10.107:2380
e4434395baae163e: name=etcd2 peerURLs=https://172.16.10.109:2380 clientURLs=https://172.16.10.109:2379 isLeader=true

$ etcdctl --ca-file /etc/ssl/etcd/ssl/ca.pem --cert-file /etc/ssl/etcd/ssl/member-master2.pem --key-file /etc/ssl/etcd/ssl/member-master2-key.pem --endpoint "https://127.0.0.1:2379" member list
6e8f4db2bc656917: name=etcd3 peerURLs=https://172.16.10.103:2380 clientURLs=https://172.16.10.103:2379 isLeader=false
962641bc9caa06aa: name=etcd1 peerURLs=https://172.16.10.107:2380 clientURLs=https://172.16.10.107:2379 isLeader=false
e4434395baae163e: name=etcd2 peerURLs=https://172.16.10.109:2380 clientURLs=https://172.16.10.109:2379 isLeader=true

$ etcdctl --ca-file /etc/ssl/etcd/ssl/ca.pem --cert-file /etc/ssl/etcd/ssl/member-master2.pem --key-file /etc/ssl/etcd/ssl/member-master2-key.pem --endpoint "https://127.0.0.1:2379" cluster-health
member 6e8f4db2bc656917 is healthy: got healthy result from https://172.16.10.103:2379
member 962641bc9caa06aa is healthy: got healthy result from https://172.16.10.107:2379
member e4434395baae163e is healthy: got healthy result from https://172.16.10.109:2379

yay.