본문 바로가기

가상화/Proxmox

[proxmox] HA Fencing (softdog을 이용한 node reboot)

반응형

Fencing 은 VM HA 구성에서 오류가 발생한 노드에 대해 오프라인을 보장한다.

예를 들어 3개의 노드(TEST01, TEST02, TEST03)가 Cluster 구성되어 있고 1개의 VM에 대하여 HA 설정이 되어 있다고 가정한다.

이때 외부 스토리지 네트워크를 제외한 모든 네트워크(cluster 링크가 포함된)가 알수 없는 이유로 다운되었을 경우 active 노드(TEST03)를 차단한다. 

이는 active 노드 (TEST03) 가 다시 살아나 failover 된 노드 (TEST02)의 VM과 스토리지에 동시 쓰기를 방지 하기 위함으로 무경성을 보장한다. 

 

노드 차단은 세가지 방법이 있다.

  • external power switches
  • isolate nodes by disabling complete network traffic on the switch
  • self fencing using watchdog timers

외부 장치의 도움이 없는 상황이라면 watchdog을 사용하게 되며, 또한 하드웨어 지원이 없다면 커널의 softdog을 사용한다.

softdog timer는 하드코딩되어 변경할 수 없으며 약 30초후 노드를 reboot 한다.

 

참조 : https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_how_it_works

 

High Availability

When updating the ha-manager, you should do one node after the other, never all at once for various reasons. First, while we test our software thoroughly, a bug affecting your specific setup cannot totally be ruled out. Updating one node after the other an

pve.proxmox.com

 

 

이러한 fencing이 매우 위험한 경우가 있다.

외부 스토리지를 사용하는 1개의 VM에 대해 HA 설정을 하고 다른 다수의 VM은 로컬스토리지를 사용하여 HA 설정을 하지 않았을 경우 차단된 노드(TEST03)에 속한 다수의 로컬스토리지 VM도 노드가 reboot 됨에 따라 다운된다는 것이다.

 

 

Cluster 구성된 3개의 노드가 있다.

노드 ID는 1,2,3 이다.

> pvecm node

Membership information
----------------------
    Nodeid      Votes Name
         1          1 TEST01
         2          1 TEST02
         3          1 TEST03 (local)

 

ha-manager 상태를 보면 TEST03이 master 이며 HA VM의 active 노드로 동작하고 있다.

> ha-manager status
quorum OK
master TEST03 (active, Tue Sep 26 16:41:24 2023)
lrm TEST01 (idle, Tue Sep 26 16:41:27 2023)
lrm TEST02 (idle, Tue Sep 26 16:41:27 2023)
lrm TEST03 (active, Tue Sep 26 16:41:26 2023)

 

 

이때 active 노드인 TEST03이 알수 없는 이유로 네트워크가 차단된다.

cluster link number는 0이다.

Sep 21 11:13:52 TEST03 corosync[1416]:   [KNET  ] link: host: 1 link: 0 is down
Sep 21 11:13:52 TEST03 corosync[1416]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 21 11:13:52 TEST03 corosync[1416]:   [KNET  ] host: host: 1 has no active links
Sep 21 11:13:54 TEST03 corosync[1416]:   [KNET  ] rx: host: 1 link: 0 is up
Sep 21 11:13:54 TEST03 corosync[1416]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Sep 21 11:13:54 TEST03 corosync[1416]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 21 11:13:54 TEST03 corosync[1416]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Sep 21 11:15:48 TEST03 pmxcfs[1306]: [status] notice: received log
Sep 21 11:17:01 TEST03 CRON[951896]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Sep 21 11:19:25 TEST03 corosync[1416]:   [KNET  ] link: host: 2 link: 0 is down
Sep 21 11:19:25 TEST03 corosync[1416]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 21 11:19:25 TEST03 corosync[1416]:   [KNET  ] host: host: 2 has no active links
Sep 21 11:19:27 TEST03 corosync[1416]:   [KNET  ] rx: host: 2 link: 0 is up
Sep 21 11:19:27 TEST03 corosync[1416]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Sep 21 11:19:27 TEST03 corosync[1416]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 21 11:19:27 TEST03 corosync[1416]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Sep 21 11:19:41 TEST03 corosync[1416]:   [TOTEM ] Retransmit List: 569d 
Sep 21 11:19:43 TEST03 corosync[1416]:   [KNET  ] link: host: 1 link: 0 is down
Sep 21 11:19:43 TEST03 corosync[1416]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 21 11:19:43 TEST03 corosync[1416]:   [KNET  ] host: host: 1 has no active links
Sep 21 11:19:45 TEST03 corosync[1416]:   [KNET  ] rx: host: 1 link: 0 is up
Sep 21 11:19:45 TEST03 corosync[1416]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Sep 21 11:19:45 TEST03 corosync[1416]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 21 11:19:45 TEST03 corosync[1416]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Sep 21 11:21:41 TEST03 corosync[1416]:   [KNET  ] link: host: 2 link: 0 is down
Sep 21 11:21:41 TEST03 corosync[1416]:   [KNET  ] link: host: 1 link: 0 is down
Sep 21 11:21:41 TEST03 corosync[1416]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 21 11:21:41 TEST03 corosync[1416]:   [KNET  ] host: host: 2 has no active links
Sep 21 11:21:41 TEST03 corosync[1416]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 21 11:21:41 TEST03 corosync[1416]:   [KNET  ] host: host: 1 has no active links
Sep 21 11:21:42 TEST03 corosync[1416]:   [TOTEM ] Token has not been received in 2737 ms 
Sep 21 11:21:43 TEST03 corosync[1416]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
Sep 21 11:21:48 TEST03 corosync[1416]:   [QUORUM] Sync members[1]: 3
Sep 21 11:21:48 TEST03 corosync[1416]:   [QUORUM] Sync left[2]: 1 2
Sep 21 11:21:48 TEST03 corosync[1416]:   [TOTEM ] A new membership (3.c4) was formed. Members left: 1 2
Sep 21 11:21:48 TEST03 corosync[1416]:   [TOTEM ] Failed to receive the leave message. failed: 1 2

 

 

11:21:48 경 corosync 가 1(TEST01),2(TEST02) 노드를 배제한다.

이후 softdog timer에 의해 11:22:22 경에 reboot 된다.

Sep 21 11:22:22 TEST03 systemd[1]: Stopping User Manager for UID 0...
Sep 21 11:22:22 TEST03 systemd[3814]: Stopped target Main User Target.
Sep 21 11:22:22 TEST03 systemd[3814]: Stopped target Basic System.
Sep 21 11:22:22 TEST03 systemd[3814]: Stopped target Paths.
Sep 21 11:22:22 TEST03 systemd[3814]: Stopped target Sockets.
Sep 21 11:22:22 TEST03 systemd[3814]: Stopped target Timers.
Sep 21 11:22:22 TEST03 systemd[3814]: dirmngr.socket: Succeeded.
Sep 21 11:22:22 TEST03 systemd[3814]: Closed GnuPG network certificate management daemon.
Sep 21 11:22:22 TEST03 systemd[3814]: gpg-agent-browser.socket: Succeeded.
Sep 21 11:22:22 TEST03 systemd[3814]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
Sep 21 11:22:22 TEST03 systemd[3814]: gpg-agent-extra.socket: Succeeded.
Sep 21 11:22:22 TEST03 systemd[3814]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Sep 21 11:22:22 TEST03 systemd[3814]: gpg-agent-ssh.socket: Succeeded.
Sep 21 11:22:22 TEST03 systemd[3814]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Sep 21 11:22:22 TEST03 systemd[3814]: gpg-agent.socket: Succeeded.
Sep 21 11:22:22 TEST03 systemd[3814]: Closed GnuPG cryptographic agent and passphrase cache.
Sep 21 11:22:22 TEST03 systemd[3814]: Removed slice User Application Slice.
Sep 21 11:22:22 TEST03 systemd[3814]: Reached target Shutdown.
Sep 21 11:22:22 TEST03 systemd[3814]: systemd-exit.service: Succeeded.
Sep 21 11:22:22 TEST03 systemd[3814]: Finished Exit the Session.
Sep 21 11:22:22 TEST03 systemd[3814]: Reached target Exit the Session.
Sep 21 11:22:22 TEST03 systemd[1]: user@0.service: Succeeded.
Sep 21 11:22:22 TEST03 systemd[1]: Stopped User Manager for UID 0.
Sep 21 11:22:22 TEST03 systemd[1]: Stopping User Runtime Directory /run/user/0...
Sep 21 11:22:22 TEST03 systemd[1]: run-user-0.mount: Succeeded.
Sep 21 11:22:22 TEST03 systemd[1]: user-runtime-dir@0.service: Succeeded.
Sep 21 11:22:22 TEST03 systemd[1]: Stopped User Runtime Directory /run/user/0.
Sep 21 11:22:22 TEST03 systemd[1]: Removed slice User Slice of UID 0.
Sep 21 11:22:22 TEST03 systemd[1]: user-0.slice: Consumed 46.349s CPU time.
Sep 21 11:22:37 TEST03 watchdog-mux[1073]: client watchdog expired - disable watchdog updates

 

 

알 수 없는 이유로 네트워크가 차단될 경우가 거의 없을 수 있다.

하지만 외부 스토리지를 사용 하여 VM를 생성하고 모든 VM에 대해 HA 설정을 하지는 않는다면 위와 같은 경우가 발생할 수 있어 운영시 고려 해야할 상황이 된다. 

(로컬 스토리지를 사용한다면 HA 설정이 힘들다. 참조 : https://ploz.tistory.com/entry/proxmox-VM-%EB%A8%B8%EC%8B%A0-HAHigh-Availability-%EC%84%A4%EC%A0%95-%ED%95%98%EA%B8%B0)

 

반응형