前面有介紹
- 關於 etcd2 設定與使用方式 – https://benjr.tw/96404
- 新增移除 etcd2 Node – https://benjr.tw/96449
這邊來試試看關於 CoreOS etcd2 Cluster 的災難復原,參考文章 https://coreos.com/etcd/docs/latest/etcd-live-cluster-reconfiguration.html#etcd-disaster-recovery-on-coreos
系統上有兩台 CoreOS 並組合成為一個 Cluster.
- CoreOS1 (Node1) , IP: 172.16.15.21
- CoreOS2 (Node2) , IP: 172.16.15.22
但這兩台都已經關閉,下次開機時 etcd2 的服務運作要如何恢復?
Node1
在 Node1 開機後會發現 Cluster 的狀態都不正常.
core@coreos1 ~ $ etcdctl cluster-health failed to check the health of member 36b8e800818109bd on http://172.16.15.22:2379: Get http://172.16.15.22:2379/health: dial tcp 172.16.15.22:2379: getsockopt: no route to host member 36b8e800818109bd is unreachable: [http://172.16.15.22:2379] are all unreachable member 432deaac673805ba is unhealthy: got unhealthy result from http://172.16.15.21:2379 cluster is unhealthy core@coreos1 ~ $ etcdctl member list Failed to get leader: client: etcd cluster is unavailable or misconfigured
第一步就是先把 etcd2 的服務關閉.
core@coreos1 ~ $ sudo systemctl stop etcd2
把系統備份的資料回復到 etcd2
core@coreos1 ~ $ sudo etcdctl backup --data-dir /var/lib/etcd2 --backup-dir /var/lib/etcd2_backup
強制目前這一台 Node 成為新的 Cluster Leader.
core@coreos1 ~ $ sudo vi /run/systemd/system/etcd2.service.d/98-force-new-cluster.conf [Service] Environment="ETCD_FORCE_NEW_CLUSTER=true" core@coreos1 ~ $ sudo systemctl daemon-reload core@coreos1 ~ $ sudo systemctl start etcd2
檢查一下,Cluster 的確開始正常運作了.
core@coreos1 ~ $ etcdctl member list 432deaac673805ba: name=node01 peerURLs=http://172.16.15.21:2380 clientURLs=http://172.16.15.21:2379 isLeader=true core@coreos1 ~ $ etcdctl cluster-health member 432deaac673805ba is healthy: got healthy result from http://172.16.15.21:2379 cluster is healthy
但目前只有第一台回復正常,接下來把 Node2 加回到這一個 Cluster
core@coreos1 ~ $ etcdctl member add node02 http://172.16.15.22:2380 Added member named node02 with ID 99161f8ce735e019 to cluster ETCD_NAME="node02" ETCD_INITIAL_CLUSTER="node01=http://172.16.15.21:2380,node02=http://172.16.15.22:2380" ETCD_INITIAL_CLUSTER_STATE="existing"
已經可以看到 Node2 的資料已經加入到 Cluster ,但狀態還是不正常,需要到 Node2 做一些設定.
core@coreos1 ~ $ etcdctl member list 432deaac673805ba: name=node01 peerURLs=http://172.16.15.21:2380 clientURLs=http://172.16.15.21:2379 isLeader=true 99161f8ce735e019[unstarted]: peerURLs=http://172.16.15.22:2380 core@coreos1 ~ $ etcdctl cluster-health member 432deaac673805ba is unhealthy: got unhealthy result from http://172.16.15.21:2379 member 99161f8ce735e019 is unreachable: no available published client urls cluster is unhealthy
Node2
雖然 Node2 的資料已經加入到 Cluster ,但狀態還是不正常,需要做一些設定.
core@coreos2 ~ $ etcdctl member list 432deaac673805ba: name=node01 peerURLs=http://172.16.15.21:2380 clientURLs=http://172.16.15.21:2379 isLeader=true 99161f8ce735e019[unstarted]: peerURLs=http://172.16.15.22:2380 core@coreos2 ~ $ etcdctl cluster-health cluster may be unhealthy: failed to list members Error: client: etcd cluster is unavailable or misconfigured error #0: dial tcp 127.0.0.1:2379: getsockopt: connection refused error #1: dial tcp 127.0.0.1:4001: getsockopt: connection refused
第一步把舊的 Cluster 資料移除.
core@coreos2 ~ $ sudo su - coreos2 ~ # cd /var/lib/etcd2/ coreos2 etcd2 # ls member coreos2 etcd2 # rm -rf member/ coreos2 etcd2 # exit logout
etcd2 的資料重新設定.
core@coreos2 ~ $ sudo vi /run/systemd/system/etcd2.service.d/99-restore.conf [Service] ExecStartPre=/usr/bin/rm -rf /var/lib/etcd2/proxy Environment="ETCD_DISCOVERY=" Environment="ETCD_NAME=node02" Environment="ETCD_INITIAL_CLUSTER=node01=http://172.16.15.21:2380,node02=http://172.16.15.22:2380" Environment="ETCD_INITIAL_CLUSTER_STATE=existing" core@coreos2 ~ $ sudo systemctl daemon-reload core@coreos2 ~ $ sudo systemctl restart etcd2 core@coreos2 ~ $ systemctl status etcd2 ● etcd2.service - etcd2 Loaded: loaded (/usr/lib/systemd/system/etcd2.service; disabled; vendor preset: dis Drop-In: /run/systemd/system/etcd2.service.d └─20-cloudinit.conf, 99-restore.conf Active: active (running) since Fri 2017-01-20 03:55:32 UTC; 8s ago Process: 2155 ExecStartPre=/usr/bin/rm -rf /var/lib/etcd2/proxy (code=exited, status Main PID: 2160 (etcd2) Tasks: 7 Memory: 15.7M CPU: 187ms CGroup: /system.slice/etcd2.service └─2160 /usr/bin/etcd2 Jan 20 03:55:32 coreos2 etcd2[2160]: starting server... [version: 2.3.7, cluster versi Jan 20 03:55:32 coreos2 systemd[1]: Started etcd2. Jan 20 03:55:32 coreos2 etcd2[2160]: added member e58882fe50ee8d1 [http://172.16.15.22 Jan 20 03:55:32 coreos2 etcd2[2160]: removed member e58882fe50ee8d1 from cluster 259ed Jan 20 03:55:32 coreos2 etcd2[2160]: added member 36b8e800818109bd [http://172.16.15.2 Jan 20 03:55:32 coreos2 etcd2[2160]: the connection with 432deaac673805ba became activ Jan 20 03:55:32 coreos2 etcd2[2160]: removed member 36b8e800818109bd from cluster 259e Jan 20 03:55:32 coreos2 etcd2[2160]: added local member 99161f8ce735e019 [http://172.1 Jan 20 03:55:32 coreos2 etcd2[2160]: raft.node: 99161f8ce735e019 elected leader 432dea Jan 20 03:55:32 coreos2 etcd2[2160]: published {Name:node02 ClientURLs:[http://172.16.
這時候可以看到 Cluster 的狀態又恢復到正常了.
core@coreos2 ~ $ etcdctl member list 432deaac673805ba: name=node01 peerURLs=http://172.16.15.21:2380 clientURLs=http://172.16.15.21:2379 isLeader=true 99161f8ce735e019: name=node02 peerURLs=http://172.16.15.22:2380 clientURLs=http://172.16.15.22:2379 isLeader=false core@coreos2 ~ $ etcdctl cluster-health member 432deaac673805ba is healthy: got healthy result from http://172.16.15.21:2379 member 99161f8ce735e019 is healthy: got healthy result from http://172.16.15.22:2379 cluster is healthy
沒有解決問題,試試搜尋本站其他內容
4 thoughts on “CoreOS – etcd2 Cluster 的災難復原”