我有一个两节点集群(基于 CentOS7),旨在主动/被动地使用 DRBD 资源和依赖于它们的应用程序资源,以及通过排序约束依赖于应用程序的集群 ip。我没有托管限制。相反,我所有的资源都在同一个组中,因此它们一起迁移。
每个节点上有 2 个网络接口:一个是 LAN,另一个是私有点对点连接。DRBD 被配置为使用点对点。通过将 RRP 模式设置为被动,两个网络都配置为 RRP,LAN 是主要的 Pacemaker/Corosync 连接,点对点作为备份。
通过重新启动或关闭活动节点的故障转移工作正常,所有资源成功迁移到幸存者。这就是好消息停止的地方。
如果活动节点与 ping 主机的连接松动,我有一个 ping 资源 ping 可在 LAN 接口上访问的主机,其位置约束基于 ping 以将资源组移动到被动节点。然而,这部分不能正常工作。
当我在活动节点上拉 LAN 网络电缆时,活动节点无法再 ping ping 主机,并且资源在当前活动节点上停止 - 正如预期的那样。请记住,由于 RRP,Corosync 在退回到专用网络时仍然可以相互通信。然而,资源无法在以前的被动节点(仍然可以连接到网关并且现在应该变得活跃的节点)上启动,因为 DRBD 资源仍然是拉断电缆的节点上的主要资源,因此文件系统可以t 安装在应该接管的那个上。请记住,DRBD 在此期间会继续连接到专用网络,因为它的插头没有拔掉。
我无法弄清楚为什么基于 ping 的位置约束没有将资源组正确迁移到 DRBD 主要/次要设置。我希望这里有人可以提供帮助。以下是我拔掉电缆并且集群在卡住之前尽可能迁移后的状态。
[root@za-mycluster1 ~]# pcs status
Cluster name: MY_HA
Stack: corosync
Current DC: za-mycluster1.sMY.co.za (version 1.1.20-5.el7-3c4c782f70) - partition with quorum
Last updated: Fri Apr 24 19:12:57 2020
Last change: Fri Apr 24 16:39:45 2020 by hacluster via crmd on za-mycluster1.sMY.co.za
2 nodes configured
14 resources configured
Online: [ za-mycluster1.sMY.co.za za-mycluster2.sMY.co.za ]
Full list of resources:
Master/Slave Set: LV_DATAClone [LV_DATA]
Masters: [ za-mycluster1.sMY.co.za ]
Slaves: [ za-mycluster2.sMY.co.za ]
Resource Group: mygroup
LV_DATAFS (ocf::heartbeat:Filesystem): Stopped
LV_POSTGRESFS (ocf::heartbeat:Filesystem): Stopped
postgresql_9.6 (systemd:postgresql-9.6): Stopped
LV_HOMEFS (ocf::heartbeat:Filesystem): Stopped
myapp (lsb:myapp): Stopped
ClusterIP (ocf::heartbeat:IPaddr2): Stopped
Master/Slave Set: LV_POSTGRESClone [LV_POSTGRES]
Masters: [ za-mycluster1.sMY.co.za ]
Slaves: [ za-mycluster2.sMY.co.za ]
Master/Slave Set: LV_HOMEClone [LV_HOME]
Masters: [ za-mycluster1.sMY.co.za ]
Slaves: [ za-mycluster2.sMY.co.za ]
Clone Set: pingd-clone [pingd]
Started: [ za-mycluster1.sMY.co.za za-mycluster2.sMY.co.za ]
Failed Resource Actions:
* LV_DATAFS_start_0 on za-mycluster2.sMY.co.za 'unknown error' (1): call=57, status=complete, exitreason='Couldn't mount device [/dev/drbd0] as /data',
last-rc-change='Fri Apr 24 16:59:10 2020', queued=0ms, exec=75ms
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
请注意在迁移目标上安装 DRBD 文件系统的错误。此时查看 DRBD 状态显示节点 1 仍然是主节点,因此当其他资源停止时,DRBD 资源从未设置为辅助节点。
[root@za-mycluster1 ~]# cat /proc/drbd
version: 8.4.11-1 (api:1/proto:86-101)
GIT-hash: 66145a308421e9c124ec391a7848ac20203bb03c build by mockbuild@, 2018-11-03 01:26:55
0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:169816 nr:0 dw:169944 dr:257781 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:6108 nr:0 dw:10324 dr:17553 al:14 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:3368 nr:0 dw:4380 dr:72609 al:6 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
这是配置的样子
[root@za-mycluster1 ~]# pcs config
Cluster Name: MY_HA
Corosync Nodes:
za-mycluster1.sMY.co.za za-mycluster2.sMY.co.za
Pacemaker Nodes:
za-mycluster1.sMY.co.za za-mycluster2.sMY.co.za
Resources:
Master: LV_DATAClone
Meta Attrs: clone-max=2 clone-node-max=1 master-max=1 master-node-max=1 notify=true
Resource: LV_DATA (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=lv_DATA
Operations: demote interval=0s timeout=90 (LV_DATA-demote-interval-0s)
monitor interval=60s (LV_DATA-monitor-interval-60s)
notify interval=0s timeout=90 (LV_DATA-notify-interval-0s)
promote interval=0s timeout=90 (LV_DATA-promote-interval-0s)
reload interval=0s timeout=30 (LV_DATA-reload-interval-0s)
start interval=0s timeout=240 (LV_DATA-start-interval-0s)
stop interval=0s timeout=100 (LV_DATA-stop-interval-0s)
Group: mygroup
Resource: LV_DATAFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd0 directory=/data fstype=ext4
Operations: monitor interval=20s timeout=40s (LV_DATAFS-monitor-interval-20s)
notify interval=0s timeout=60s (LV_DATAFS-notify-interval-0s)
start interval=0s timeout=60s (LV_DATAFS-start-interval-0s)
stop interval=0s timeout=60s (LV_DATAFS-stop-interval-0s)
Resource: LV_POSTGRESFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd1 directory=/var/lib/pgsql fstype=ext4
Operations: monitor interval=20s timeout=40s (LV_POSTGRESFS-monitor-interval-20s)
notify interval=0s timeout=60s (LV_POSTGRESFS-notify-interval-0s)
start interval=0s timeout=60s (LV_POSTGRESFS-start-interval-0s)
stop interval=0s timeout=60s (LV_POSTGRESFS-stop-interval-0s)
Resource: postgresql_9.6 (class=systemd type=postgresql-9.6)
Operations: monitor interval=60s (postgresql_9.6-monitor-interval-60s)
start interval=0s timeout=100 (postgresql_9.6-start-interval-0s)
stop interval=0s timeout=100 (postgresql_9.6-stop-interval-0s)
Resource: LV_HOMEFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd2 directory=/home fstype=ext4
Operations: monitor interval=20s timeout=40s (LV_HOMEFS-monitor-interval-20s)
notify interval=0s timeout=60s (LV_HOMEFS-notify-interval-0s)
start interval=0s timeout=60s (LV_HOMEFS-start-interval-0s)
stop interval=0s timeout=60s (LV_HOMEFS-stop-interval-0s)
Resource: myapp (class=lsb type=myapp)
Operations: force-reload interval=0s timeout=15 (myapp-force-reload-interval-0s)
monitor interval=60s on-fail=standby timeout=10s (myapp-monitor-interval-60s)
restart interval=0s timeout=120s (myapp-restart-interval-0s)
start interval=0s timeout=60s (myapp-start-interval-0s)
stop interval=0s timeout=60s (myapp-stop-interval-0s)
Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
Attributes: cidr_netmask=32 ip=192.168.51.185
Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
start interval=0s timeout=20s (ClusterIP-start-interval-0s)
stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
Master: LV_POSTGRESClone
Meta Attrs: clone-max=2 clone-node-max=1 master-max=1 master-node-max=1 notify=true
Resource: LV_POSTGRES (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=lv_postgres
Operations: demote interval=0s timeout=90 (LV_POSTGRES-demote-interval-0s)
monitor interval=60s (LV_POSTGRES-monitor-interval-60s)
notify interval=0s timeout=90 (LV_POSTGRES-notify-interval-0s)
promote interval=0s timeout=90 (LV_POSTGRES-promote-interval-0s)
reload interval=0s timeout=30 (LV_POSTGRES-reload-interval-0s)
start interval=0s timeout=240 (LV_POSTGRES-start-interval-0s)
stop interval=0s timeout=100 (LV_POSTGRES-stop-interval-0s)
Master: LV_HOMEClone
Meta Attrs: clone-max=2 clone-node-max=1 master-max=1 master-node-max=1 notify=true
Resource: LV_HOME (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=lv_home
Operations: demote interval=0s timeout=90 (LV_HOME-demote-interval-0s)
monitor interval=60s (LV_HOME-monitor-interval-60s)
notify interval=0s timeout=90 (LV_HOME-notify-interval-0s)
promote interval=0s timeout=90 (LV_HOME-promote-interval-0s)
reload interval=0s timeout=30 (LV_HOME-reload-interval-0s)
start interval=0s timeout=240 (LV_HOME-start-interval-0s)
stop interval=0s timeout=100 (LV_HOME-stop-interval-0s)
Clone: pingd-clone
Resource: pingd (class=ocf provider=pacemaker type=ping)
Attributes: dampen=5s host_list=192.168.51.1 multiplier=1000
Operations: monitor interval=10 timeout=60 (pingd-monitor-interval-10)
start interval=0s timeout=60 (pingd-start-interval-0s)
stop interval=0s timeout=20 (pingd-stop-interval-0s)
Stonith Devices:
Fencing Levels:
Location Constraints:
Resource: mygroup
Constraint: location-mygroup
Rule: boolean-op=or score=-INFINITY (id:location-mygroup-rule)
Expression: pingd lt 1 (id:location-mygroup-rule-expr)
Expression: not_defined pingd (id:location-mygroup-rule-expr-1)
Ordering Constraints:
promote LV_DATAClone then start LV_DATAFS (kind:Mandatory) (id:order-LV_DATAClone-LV_DATAFS-mandatory)
promote LV_POSTGRESClone then start LV_POSTGRESFS (kind:Mandatory) (id:order-LV_POSTGRESClone-LV_POSTGRESFS-mandatory)
start LV_POSTGRESFS then start postgresql_9.6 (kind:Mandatory) (id:order-LV_POSTGRESFS-postgresql_9.6-mandatory)
promote LV_HOMEClone then start LV_HOMEFS (kind:Mandatory) (id:order-LV_HOMEClone-LV_HOMEFS-mandatory)
start LV_HOMEFS then start myapp (kind:Mandatory) (id:order-LV_HOMEFS-myapp-mandatory)
start myapp then start ClusterIP (kind:Mandatory) (id:order-myapp-ClusterIP-mandatory)
Colocation Constraints:
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
resource-stickiness=INFINITY
Operations Defaults:
timeout=240s
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: MY_HA
dc-version: 1.1.20-5.el7-3c4c782f70
have-watchdog: false
no-quorum-policy: ignore
stonith-enabled: false
Quorum:
Options:
任何见解都将受到欢迎。