Migrate from single sNow! server to multiple sNow! servers working in High Availability cluster

Edit me

The following notes describe how to migrate from a single sNow! server to multiple sNow! servers working in High Availability (HA) cluster and a shared file system provided by external servers.

This guide assumes that:

You have at least two nodes to install sNow!
The NFS client has been installed in sNow! nodes
The /sNow path is mounted directly from NFS or it’s a mount bind to that path
The /home path is mounted directly from NFS or it’s a mount bind to that path
You have followed the instructions defined here

Domain Images Migration

By default, the domains OS images are stored in logical volumes managed by LVM2.

The parameter IMG_DST defines how the images are stored.

IMG_DST='lvm=snow_vg': the domains OS images are stored inside a Logical Volume created in the snow_vg Volume Group.
IMG_DST='dir=/sNow/domains': the domains OS images are stored inside loopback files located inside /sNow/domains folder.
IMG_DST='nfs=$NFS_SERVER:/sNow/domains': the domains OS images are stored and exposed through a NFS server.

If your IMG_DST is set to IMG_DST='lvm=snow_vg', you will need to migrate those images to a loopback image files or to create additional NFS exports to enable HA mode over a shared file system. The following steps will guide you to migrate domain OS images from LVM to loopback files.

Stop the domains:
```
snow shutdown domains
```

Create a script with the following content. This will dump the file system to a loopback file for each domain and update the domain configuration files:

#!/bin/bash
for domain in $(snow list domains| egrep -v "Domain|\-\-" | gawk '{print $1}'); do
  mkdir -p /sNow/domains/$domain
  dd if=/dev/snow_vg/${domain}-disk of=/sNow/domains/${domain}/${domain}-disk
  dd if=/dev/snow_vg/${domain}-swap of=/sNow/domains/${domain}/${domain}-swap
  sed -i "s|phy:/dev/snow_vg/|tap:aio:/sNow/domains/$domain/|g" /sNow/snow-tools/etc/domains/$domain.cfg
done

Run the script:
```
./migrate_lvm2loopback
```
Try to boot the domain:
```
snow boot domain_name
```
Install sNow! software in the new sNow! servers

At this point, the new sNow! servers should have the NFS client setup and the sNow! release should be the latest stable release.

Install sNow! software as usual with the install.sh file available in the /sNow/snow-tools:

cd /sNow/snow-tools
./install.sh

Once the installation is completed, you will need to reboot the server in order to boot with the new kernel.

Initiate sNow! with snow init once the new nodes are up and running. This command will print the following warning message.

root@snow02:~# snow init
[W] sNow! configuration had been initiated before.
[W] Please, do not run this command in a production environment
[W] Do you want to proceed? (y/N)

At this point, it’s safe to proceed because you are initiating new nodes. Please, don’t consider to run this command in a production environment.

Install the required software packages on all the sNow! servers

apt update
apt install libqb0 fence-agents pacemaker corosync pacemaker-cli-utils crmsh drbd-utils -y

Do the same in the other sNow! nodes

Disable auto start of corosync and pacemaker

In order to avoid a death match situation, it’s highly recommended to disable corosync and pacemaker to be started at boot time. To start the service, you only need to start pacemaker.

systemctl disable corosync
systemctl disable pacemaker

Do the same in the other sNow! nodes

Setting up Corosync

Generate keys

corosync-keygen
Corosync Cluster Engine Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Press keys on your keyboard to generate entropy.

Setup the right permissions

chmod 400 /etc/corosync/authkey

Transfer the key to the other nodes

scp -p /etc/corosync/authkey snow02:/etc/corosync/authkey

Configuring corosync

The content of /etc/corosync/corosync.conf should be something similar to the following example. Note that this cluster only has two nodes (snow01 and snow02).

# egrep -v "^$|#" /etc/corosync/corosync.conf
totem {
        version: 2
        cluster_name: snow
        token: 3000
        token_retransmits_before_loss_const: 10
        clear_node_high_bit: yes
        interface {
                ringnumber: 0
                bindnetaddr: 192.168.8.0
                mcastport: 5405
                ttl: 1
        }
}
logging {
        fileline: off
        to_stderr: no
        to_logfile: no
        to_syslog: yes
        syslog_facility: daemon
        debug: off
        timestamp: on
        logger_subsys {
                subsys: QUORUM
                debug: off
        }
}
quorum {
        provider: corosync_votequorum
        expected_votes: 2
        two_node: 1
        wait_for_all: 1
}
nodelist {
    node {
        ring0_addr: snow01
        nodeid: 1
    }
    node {
        ring0_addr: snow02
        nodeid: 2
    }
}

You can download the following example file and adapt it to accommodate your needs.

Edit /etc/corosync/corosync.conf in snow01 and transfer this file to the other nodes:

scp -p /etc/corosync/corosync.conf snow02:/etc/corosync/corosync.conf

Start corosync service on all the nodes

systemctl start corosync

Xen configuration

Review if /etc/default/xendomains has an empty value for XENDOMAINS_SAVE variable or if it’s commented. Otherwise, comment this variable.

Test if Xen allows live migration between sNow! nodes

Execute the following commands in order to certify that the para-virtual machines can be migrated across the sNow! nodes:

From snow01 you can execute the following commands:

snow boot deploy01
xl migrate deploy01 snow02

If it works as expected, you should see an output message similar to the following example:

[4287] snow01:~ $ xl migrate deploy01 snow02
migration target: Ready to receive domain.
Saving to migration stream new xl format (info 0x0/0x0/799)
Loading new save file <incoming migration stream> (new xl fmt info 0x0/0x0/799)
 Savefile contains xl domain config
xc: progress: Reloading memory pages: 26624/524288    5%
xc: progress: Reloading memory pages: 53248/524288   10%
xc: progress: Reloading memory pages: 78848/524288   15%
xc: progress: Reloading memory pages: 105472/524288   20%
xc: progress: Reloading memory pages: 131072/524288   25%
xc: progress: Reloading memory pages: 157696/524288   30%
xc: progress: Reloading memory pages: 184320/524288   35%
xc: progress: Reloading memory pages: 209920/524288   40%
xc: progress: Reloading memory pages: 236544/524288   45%
xc: progress: Reloading memory pages: 262144/524288   50%
xc: progress: Reloading memory pages: 288768/524288   55%
xc: progress: Reloading memory pages: 315392/524288   60%
xc: progress: Reloading memory pages: 340992/524288   65%
xc: progress: Reloading memory pages: 367616/524288   70%
xc: progress: Reloading memory pages: 393216/524288   75%
xc: progress: Reloading memory pages: 419840/524288   80%
xc: progress: Reloading memory pages: 446464/524288   85%
xc: progress: Reloading memory pages: 472064/524288   90%
xc: progress: Reloading memory pages: 498688/524288   95%
xc: progress: Reloading memory pages: 524482/524288  100%
migration target: Transfer complete, requesting permission to start domain.
migration sender: Target has acknowledged transfer.
migration sender: Giving target permission to start.
migration target: Got permission, starting domain.
migration target: Domain started successsfully.
migration sender: Target reports successful startup.
Migration successful.

Pacemaker

Setup the right permissions and check the cluster health

chown -R hacluster:haclient /var/lib/pacemaker
chmod 750 /var/lib/pacemaker
ssh snow02 chown -R hacluster:haclient /var/lib/pacemaker
ssh snow02 chmod 750 /var/lib/pacemaker
crm cluster health | more

Setup Pacemaker

The following steps can be automated taking advantage of the following script: setup_domains_ha.sh

#!/bin/bash
domain_list=$(snow list domains | egrep -v "Domain|------" | gawk '{print $1}')
crm_attribute --type op_defaults --attr-name timeout --attr-value 120s
rm -f pacemaker.cfg
echo "property stonith-enabled=no" > pacemaker.cfg
echo "property no-quorum-policy=ignore" >>  pacemaker.cfg
echo "property default-resource-stickiness=100" >> pacemaker.cfg
echo "primitive xsnow-vip ocf:heartbeat:IPaddr2 params ip=\"10.1.0.254\" nic=\"xsnow0\" op monitor interval=\"10s\"" >> pacemaker.cfg
for domain in ${domain_list}; do
    echo "primitive $domain ocf:heartbeat:Xen \\
          params xmfile=\"/sNow/snow-tools/etc/domains/$domain.cfg\" \\
          op monitor interval=\"40s\" \\
          meta target-role=\"started\" allow-migrate=\"true\"
         " >> pacemaker.cfg
done
echo commit >> pacemaker.cfg
echo bye >> pacemaker.cfg
crm configure < pacemaker.cfg

Otherwise, you can follow the next steps to setup Pacemaker:

Define global configuration

Iniciate the setup without STONITH. The last section explains how to setup STONITH using a fence device based on Xen.

[4242] snow01:~ $ crm configure
crm(live)configure# property stonith-enabled=no
crm(live)configure# property no-quorum-policy=ignore
crm(live)configure# property default-resource-stickiness=100
crm(live)configure# commit
crm(live)configure# bye

Define the first service in HA

Execute the crm configure and define the first service in HA as follows:

primitive deploy01 ocf:heartbeat:Xen \
 params xmfile="/sNow/snow-tools/etc/domains/deploy01.cfg" \
 op monitor interval="40s" \
 meta target-role="started" allow-migrate="true"

Some operations like the live migration require some extra time. Especially when the VM uses a reasonable amount of memory. It’s highly recommended to increase the default timeout to avoid canceling the live migration due to a short time limit. The following example, setup 120s as default timeout. You can tune this value attending at your VM needs.

crm_attribute --type op_defaults --attr-name timeout --attr-value 120s

Test!

This is a good moment to test if the HA works. In order to review that, execute in a new SSH session the command:

crm_mon

It should report something like this:

Stack: corosync
Current DC: snow01 (version 1.1.16-94ff4df) - partition with quorum
Last updated: Fri Jul 28 06:51:02 2017
Last change: Fri Jul 28 06:50:50 2017 by root via crm_resource on snow02

2 nodes configured
1 resource configured

Online: [ snow01 snow02 ]

Active resources:

deploy01        (ocf::heartbeat:Xen):   Started snow01

Using the following command, you will force to migrate the service from one node to the other one:

crm resource move deploy01 snow01

Define all the HA services

Follow the previous instructions to setup the services required to be in HA mode. The expected outcome should be something like this:

Stack: corosync
Current DC: snow01 (version 1.1.16-94ff4df) - partition with quorum
Last updated: Fri Jul 28 07:13:00 2017
Last change: Fri Jul 28 07:11:31 2017 by root via cibadmin on snow01

2 nodes configured
8 resources configured

Online: [ snow01 snow02 ]

Active resources:

deploy01        (ocf::heartbeat:Xen):   Started snow01
proxy01 (ocf::heartbeat:Xen):   Started snow01
monitor01       (ocf::heartbeat:Xen):   Started snow02
nis01   (ocf::heartbeat:Xen):   Started snow02
syslog01        (ocf::heartbeat:Xen):   Started snow01
maui01  (ocf::heartbeat:Xen):   Started snow02
nis02   (ocf::heartbeat:Xen):   Started snow01
flexlm01        (ocf::heartbeat:Xen):   Started snow02

Define floating IP for gateway

sNow! servers play a gateway role. The following instructions define HA for this service.

Execute the crm configure and define the xsnow-vip service as follows:

primitive xsnow-vip ocf:heartbeat:IPaddr2 params ip="10.1.0.254" nic="xsnow0" op monitor interval="10s"
commit
bye

Notice that this IP 10.1.0.254 must match with the IP defined in NET_SNOW and NET_COMP in the snow.conf

Service placement

In order to balance services across the two nodes and also to distribute additional services with native HA (i.e slurm-master slurm-slave) you can use the following instructions to define the preferred hosts. Failback is useful to define well-balanced services, but if you have an ongoing issue, you could trigger a failback in a “semi-faulty” node.

crm(live)# configure
crm(live)configure# location cli-prefer-maui01 maui01 role=Started inf: snow01
crm(live)configure# location cli-prefer-nis01 nis01 role=Started inf: snow01
crm(live)configure# location cli-prefer-proxy01 proxy01 role=Started inf: snow01
crm(live)configure# location cli-prefer-flexlm01 flexlm01 role=Started inf: snow02
crm(live)configure# location cli-prefer-monitor01 monitor01 role=Started inf: snow02
crm(live)configure# location cli-prefer-nis02 nis02 role=Started inf: snow02
crm(live)configure# location cli-prefer-syslog01 syslog01 role=Started inf: snow02
crm(live)configure# commit
crm(live)configure# bye

Tags: