977 瀏覽數

CoreOS – etcd2 Cluster 的災難復原

前面有介紹

  1. 關於 etcd2 設定與使用方式 – http://benjr.tw/96404
  2. 新增移除 etcd2 Node – http://benjr.tw/96449

這邊來試試看關於 CoreOS etcd2 Cluster 的災難復原,參考文章 https://coreos.com/etcd/docs/latest/etcd-live-cluster-reconfiguration.html#etcd-disaster-recovery-on-coreos

系統上有兩台 CoreOS 並組合成為一個 Cluster.

  1. CoreOS1 (Node1) , IP: 172.16.15.21
  2. CoreOS2 (Node2) , IP: 172.16.15.22

但這兩台都已經關閉,下次開機時 etcd2 的服務運作要如何恢復?

Node1

在 Node1 開機後會發現 Cluster 的狀態都不正常.

core@coreos1 ~ $ etcdctl cluster-health
failed to check the health of member 36b8e800818109bd on http://172.16.15.22:2379: Get http://172.16.15.22:2379/health: dial tcp 172.16.15.22:2379: getsockopt: no route to host
member 36b8e800818109bd is unreachable: [http://172.16.15.22:2379] are all unreachable
member 432deaac673805ba is unhealthy: got unhealthy result from http://172.16.15.21:2379
cluster is unhealthy
core@coreos1 ~ $ etcdctl member list
Failed to get leader:  client: etcd cluster is unavailable or misconfigured

第一步就是先把 etcd2 的服務關閉.

core@coreos1 ~ $ sudo systemctl stop etcd2

把系統備份的資料回復到 etcd2

core@coreos1 ~ $ sudo etcdctl backup --data-dir /var/lib/etcd2 --backup-dir /var/lib/etcd2_backup

強制目前這一台 Node 成為新的 Cluster Leader.

core@coreos1 ~ $ sudo vi /run/systemd/system/etcd2.service.d/98-force-new-cluster.conf 
[Service]
Environment="ETCD_FORCE_NEW_CLUSTER=true"

core@coreos1 ~ $ sudo systemctl daemon-reload
core@coreos1 ~ $ sudo systemctl start etcd2

檢查一下,Cluster 的確開始正常運作了.

core@coreos1 ~ $ etcdctl member list
432deaac673805ba: name=node01 peerURLs=http://172.16.15.21:2380 clientURLs=http://172.16.15.21:2379 isLeader=true
core@coreos1 ~ $ etcdctl cluster-health
member 432deaac673805ba is healthy: got healthy result from http://172.16.15.21:2379
cluster is healthy

但目前只有第一台回復正常,接下來把 Node2 加回到這一個 Cluster

core@coreos1 ~ $ etcdctl member add node02 http://172.16.15.22:2380
Added member named node02 with ID 99161f8ce735e019 to cluster

ETCD_NAME="node02"
ETCD_INITIAL_CLUSTER="node01=http://172.16.15.21:2380,node02=http://172.16.15.22:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

已經可以看到 Node2 的資料已經加入到 Cluster ,但狀態還是不正常,需要到 Node2 做一些設定.

core@coreos1 ~ $ etcdctl member list
432deaac673805ba: name=node01 peerURLs=http://172.16.15.21:2380 clientURLs=http://172.16.15.21:2379 isLeader=true
99161f8ce735e019[unstarted]: peerURLs=http://172.16.15.22:2380
core@coreos1 ~ $ etcdctl cluster-health
member 432deaac673805ba is unhealthy: got unhealthy result from http://172.16.15.21:2379
member 99161f8ce735e019 is unreachable: no available published client urls
cluster is unhealthy

Node2

雖然 Node2 的資料已經加入到 Cluster ,但狀態還是不正常,需要做一些設定.

core@coreos2 ~ $ etcdctl member list
432deaac673805ba: name=node01 peerURLs=http://172.16.15.21:2380 clientURLs=http://172.16.15.21:2379 isLeader=true
99161f8ce735e019[unstarted]: peerURLs=http://172.16.15.22:2380
core@coreos2 ~ $ etcdctl cluster-health
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured
error #0: dial tcp 127.0.0.1:2379: getsockopt: connection refused
error #1: dial tcp 127.0.0.1:4001: getsockopt: connection refused

第一步把舊的 Cluster 資料移除.

core@coreos2 ~ $ sudo su -
coreos2 ~ # cd /var/lib/etcd2/
coreos2 etcd2 # ls
member
coreos2 etcd2 # rm -rf member/
coreos2 etcd2 # exit
logout

etcd2 的資料重新設定.

core@coreos2 ~ $ sudo vi /run/systemd/system/etcd2.service.d/99-restore.conf
[Service]
ExecStartPre=/usr/bin/rm -rf /var/lib/etcd2/proxy
Environment="ETCD_DISCOVERY="
Environment="ETCD_NAME=node02"
Environment="ETCD_INITIAL_CLUSTER=node01=http://172.16.15.21:2380,node02=http://172.16.15.22:2380"
Environment="ETCD_INITIAL_CLUSTER_STATE=existing"
core@coreos2 ~ $ sudo systemctl daemon-reload
core@coreos2 ~ $ sudo systemctl restart etcd2
core@coreos2 ~ $ systemctl status etcd2
● etcd2.service - etcd2
   Loaded: loaded (/usr/lib/systemd/system/etcd2.service; disabled; vendor preset: dis
  Drop-In: /run/systemd/system/etcd2.service.d
           └─20-cloudinit.conf, 99-restore.conf
   Active: active (running) since Fri 2017-01-20 03:55:32 UTC; 8s ago
  Process: 2155 ExecStartPre=/usr/bin/rm -rf /var/lib/etcd2/proxy (code=exited, status
 Main PID: 2160 (etcd2)
    Tasks: 7
   Memory: 15.7M
      CPU: 187ms
   CGroup: /system.slice/etcd2.service
           └─2160 /usr/bin/etcd2

Jan 20 03:55:32 coreos2 etcd2[2160]: starting server... [version: 2.3.7, cluster versi
Jan 20 03:55:32 coreos2 systemd[1]: Started etcd2.
Jan 20 03:55:32 coreos2 etcd2[2160]: added member e58882fe50ee8d1 [http://172.16.15.22
Jan 20 03:55:32 coreos2 etcd2[2160]: removed member e58882fe50ee8d1 from cluster 259ed
Jan 20 03:55:32 coreos2 etcd2[2160]: added member 36b8e800818109bd [http://172.16.15.2
Jan 20 03:55:32 coreos2 etcd2[2160]: the connection with 432deaac673805ba became activ
Jan 20 03:55:32 coreos2 etcd2[2160]: removed member 36b8e800818109bd from cluster 259e
Jan 20 03:55:32 coreos2 etcd2[2160]: added local member 99161f8ce735e019 [http://172.1
Jan 20 03:55:32 coreos2 etcd2[2160]: raft.node: 99161f8ce735e019 elected leader 432dea
Jan 20 03:55:32 coreos2 etcd2[2160]: published {Name:node02 ClientURLs:[http://172.16.

這時候可以看到 Cluster 的狀態又恢復到正常了.

core@coreos2 ~ $ etcdctl member list
432deaac673805ba: name=node01 peerURLs=http://172.16.15.21:2380 clientURLs=http://172.16.15.21:2379 isLeader=true
99161f8ce735e019: name=node02 peerURLs=http://172.16.15.22:2380 clientURLs=http://172.16.15.22:2379 isLeader=false
core@coreos2 ~ $ etcdctl cluster-health
member 432deaac673805ba is healthy: got healthy result from http://172.16.15.21:2379
member 99161f8ce735e019 is healthy: got healthy result from http://172.16.15.22:2379
cluster is healthy

4 Replies to “CoreOS – etcd2 Cluster 的災難復原”

  1. 自動參照通知: CoreOS etcd2 Cluster 的容錯能力 – Benjr.tw

  2. 自動參照通知: 安裝 CoreOS – 設定 etcd2 – Benjr.tw

  3. 自動參照通知: CoreOS – Fleet – Benjr.tw

  4. 自動參照通知: CoreOS 設定檔 – Benjr.tw

發表迴響