Corosync Default Multicast Address Assignment

Pacemaker cluster stack¶

Pacemaker cluster stack is a state-of-the-art high availability and load balancing stack for the Linux platform. Pacemaker is used to make OpenStack infrastructure highly available.

Note

It is storage and application-agnostic, and in no way specific to OpenStack.

Pacemaker relies on the Corosync messaging layer for reliable cluster communications. Corosync implements the Totem single-ring ordering and membership protocol. It also provides UDP and InfiniBand based messaging, quorum, and cluster membership to Pacemaker.

Pacemaker does not inherently understand the applications it manages. Instead, it relies on resource agents (RAs) that are scripts that encapsulate the knowledge of how to start, stop, and check the health of each application managed by the cluster.

These agents must conform to one of the OCF, SysV Init, Upstart, or Systemd standards.

Pacemaker ships with a large set of OCF agents (such as those managing MySQL databases, virtual IP addresses, and RabbitMQ), but can also use any agents already installed on your system and can be extended with your own (see the developer guide).

The steps to implement the Pacemaker cluster stack are:

Install packages¶

On any host that is meant to be part of a Pacemaker cluster, establish cluster communications through the Corosync messaging layer. This involves installing the following packages (and their dependencies, which your package manager usually installs automatically):

  • pacemaker
  • pcs (CentOS or RHEL) or crmsh
  • corosync
  • fence-agents (CentOS or RHEL) or cluster-glue
  • resource-agents
  • libqb0

Set up the cluster with pcs¶

  1. Make sure pcs is running and configured to start at boot time:

    $ systemctl enable pcsd $ systemctl start pcsd
  2. Set a password for hacluster user on each host:

    Note

    Since the cluster is a single administrative domain, it is acceptable to use the same password on all nodes.

    $echo my-secret-password-no-dont-use-this-one \| passwd --stdin hacluster
  3. Use that password to authenticate to the nodes that will make up the cluster:

    Note

    The option is used to give the password on command line and makes it easier to script.

    $ pcs cluster auth controller1 controller2 controller3 \ -u hacluster -p my-secret-password-no-dont-use-this-one --force
  4. Create and name the cluster. Then, start it and enable all components to auto-start at boot time:

    $ pcs cluster setup --force --name my-first-openstack-cluster \ controller1 controller2 controller3 $ pcs cluster start --all $ pcs cluster enable --all

Note

In Red Hat Enterprise Linux or CentOS environments, this is a recommended path to perform configuration. For more information, see the RHEL docs.

Set up the cluster with crmsh

After installing the Corosync package, you must create the configuration file.

Note

For Ubuntu, you should also enable the Corosync service in the configuration file.

Corosync can be configured to work with either multicast or unicast IP addresses or to use the votequorum library.

Set up Corosync with multicast¶

Most distributions ship an example configuration file () as part of the documentation bundled with the Corosync package. An example Corosync configuration file is shown below:

Example Corosync configuration file for multicast (``corosync.conf``)

Note the following:

  • The value specifies the time, in milliseconds, during which the Corosync token is expected to be transmitted around the ring. When this timeout expires, the token is declared lost, and after tokens, the non-responding processor (cluster node) is declared dead. is the maximum time a node is allowed to not respond to cluster messages before being considered dead. The default for token is 1000 milliseconds (1 second), with 4 allowed retransmits. These defaults are intended to minimize failover times, but can cause frequent false alarms and unintended failovers in case of short network interruptions. The values used here are safer, albeit with slightly extended failover times.

  • With enabled, Corosync nodes mutually authenticates using a 128-byte shared secret stored in the file. This can be generated with the corosync-keygen utility. Cluster communications are encrypted when using .

  • In Corosync, configurations use redundant networking (with more than one interface). This means you must select a Redundant Ring Protocol (RRP) mode other than none. We recommend as the RRP mode.

    Note the following about the recommended interface configuration:

    • Each configured interface must have a unique , starting with 0.
    • The is the network address of the interfaces to bind to. The example uses two network addresses of /24 IPv4 subnets.
    • Multicast groups () must not be reused across cluster boundaries. No two distinct clusters should ever use the same multicast group. Be sure to select multicast addresses compliant with RFC 2365, “Administratively Scoped IP Multicast”.
    • For firewall configurations, Corosync communicates over UDP only, and uses (for receives) and (for sends).
  • The service declaration for the Pacemaker service may be placed in the file directly or in its own separate file, .

    Note

    If you are using Corosync version 2 on Ubuntu 14.04, remove or comment out lines under the service stanza. These stanzas enable Pacemaker to start up. Another potential problem is the boot and shutdown order of Corosync and Pacemaker. To force Pacemaker to start after Corosync and stop before Corosync, fix the start and kill symlinks manually:

    The Pacemaker service also requires an additional configuration file to be created with the following content:

    # update-rc.d pacemaker start 202345 . stop 00016 .
    uidgid { uid: hacluster gid: haclient }
  • Once created, synchronize the file (and the file if the secauth option is enabled) across all cluster nodes.

totem { version: 2 # Time (in ms) to wait for a token (1) token: 10000 # How many token retransmits before forming a new # configuration token_retransmits_before_loss_const: 10 # Turn off the virtual synchrony filter vsftype: none # Enable encryption (2) secauth: on # How many threads to use for encryption/decryption threads: 0 # This specifies the redundant ring protocol, which may be # none, active, or passive. (3) rrp_mode: active # The following is a two-ring multicast configuration. (4) interface { ringnumber: 0 bindnetaddr: 10.0.0.0 mcastaddr: 239.255.42.1 mcastport: 5405 } interface { ringnumber: 1 bindnetaddr: 10.0.42.0 mcastaddr: 239.255.42.2 mcastport: 5405 } } amf { mode: disabled } service { # Load the Pacemaker Cluster Resource Manager (5) ver: 1 name: pacemaker } aisexec { user: root group: root } logging { fileline: off to_stderr: yes to_logfile: no to_syslog: yes syslog_facility: daemon debug: off timestamp: on logger_subsys { subsys: AMF debug: off tags: enter|leave|trace1|trace2|trace3|trace4|trace6 }}

Set up Corosync with unicast¶

For environments that do not support multicast, Corosync should be configured for unicast. An example fragment of the file for unicastis is shown below:

Corosync configuration file fragment for unicast (``corosync.conf``)

Note the following:

  • If the parameter is set to , the broadcast address is used for communication. If this option is set, the parameter should not be set.

  • The directive controls the transport mechanism. To avoid the use of multicast entirely, specify the unicast transport parameter. This requires specifying the list of members in the directive. This potentially makes up the membership before deployment. The default is . The transport type can also be set to or .

  • Within the directive, it is possible to specify specific information about the nodes in the cluster. The directive can contain only the node sub-directive, which specifies every node that should be a member of the membership, and where non-default options are needed. Every node must have at least the field filled.

    Note

    For UDPU, every node that should be a member of the membership must be specified.

    Possible options are:

    • specifies the IP address of one of the nodes. is the ring number.
    • is optional when using IPv4 and required when using IPv6. This is a 32-bit value specifying the node identifier delivered to the cluster membership service. If this is not specified with IPv4, the node ID is determined from the 32-bit IP address of the system to which the system is bound with ring identifier of 0. The node identifier value of zero is reserved and should not be used.
totem { #... interface { ringnumber: 0 bindnetaddr: 10.0.0.0 broadcast: yes (1) mcastport: 5405 } interface { ringnumber: 1 bindnetaddr: 10.0.42.0 broadcast: yes mcastport: 5405 } transport: udpu (2) } nodelist { (3) node { ring0_addr: 10.0.0.12 ring1_addr: 10.0.42.12 nodeid: 1 } node { ring0_addr: 10.0.0.13 ring1_addr: 10.0.42.13 nodeid: 2 } node { ring0_addr: 10.0.0.14 ring1_addr: 10.0.42.14 nodeid: 3 } } #...

Set up Corosync with votequorum library¶

The votequorum library is part of the Corosync project. It provides an interface to the vote-based quorum service and it must be explicitly enabled in the Corosync configuration file. The main role of votequorum library is to avoid split-brain situations, but it also provides a mechanism to:

  • Query the quorum status
  • List the nodes known to the quorum service
  • Receive notifications of quorum state changes
  • Change the number of votes assigned to a node
  • Change the number of expected votes for a cluster to be quorate
  • Connect an additional quorum device to allow small clusters remain quorate during node outages

The votequorum library has been created to replace and eliminate , the disk-based quorum daemon for CMAN, from advanced cluster configurations.

A sample votequorum service configuration in the file is:

Note the following:

  • Specifying enables the votequorum library. This is the only required option.
  • The cluster is fully operational with set to 7 nodes (each node has 1 vote), quorum: 4. If a list of nodes is specified as , the value is ignored.
  • When you start up a cluster (all nodes down) and set to 1, the cluster quorum is held until all nodes are online and have joined the cluster for the first time. This parameter is new in Corosync 2.0.
  • Setting to 1 enables the Last Man Standing (LMS) feature. By default, it is disabled (set to 0). If a cluster is on the quorum edge ( set to 7; set to 4) for longer than the time specified for the parameter, the cluster can recalculate quorum and continue operating even if the next node will be lost. This logic is repeated until the number of online nodes in the cluster reaches 2. In order to allow the cluster to step down from 2 members to only 1, the parameter needs to be set. We do not recommended this for production environments.
  • specifies the time, in milliseconds, required to recalculate quorum after one or more hosts have been lost from the cluster. To perform a new quorum recalculation, the cluster must have quorum for at least the interval specified for . The default is 10000ms.
quorum { provider: corosync_votequorum (1) expected_votes: 7 (2) wait_for_all: 1 (3) last_man_standing: 1 (4) last_man_standing_window: 10000 (5) }

Start Corosync¶

Corosync is started as a regular system service. Depending on your distribution, it may ship with an LSB init script, an upstart job, or a Systemd unit file.

  • Start with the LSB init script:

    Alternatively:

    # /etc/init.d/corosync start
  • Start with upstart:

  • Start with systemd unit file:

    # systemctl start corosync

You can now check the connectivity with one of these tools.

Use the corosync-cfgtool utility with the option to get a summary of the health of the communication rings:

Use the corosync-objctl utility to dump the Corosync cluster member list:

You should see a entry for each of your constituent cluster nodes.

Note

If you are using Corosync version 2, use the corosync-cmapctl utility instead of corosync-objctl; it is a direct replacement.

# corosync-cfgtool -s Printing ring status.Local node ID 435324542RING ID 0 id = 10.0.0.82 status = ring 0 active with no faultsRING ID 1 id = 10.0.42.100 status = ring 1 active with no faults
# corosync-objctl runtime.totem.pg.mrp.srp.members runtime.totem.pg.mrp.srp.435324542.ip=r(0) ip(10.0.0.82) r(1) ip(10.0.42.100)runtime.totem.pg.mrp.srp.435324542.join_count=1runtime.totem.pg.mrp.srp.435324542.status=joinedruntime.totem.pg.mrp.srp.983895584.ip=r(0) ip(10.0.0.87) r(1) ip(10.0.42.254)runtime.totem.pg.mrp.srp.983895584.join_count=1runtime.totem.pg.mrp.srp.983895584.status=joined

Start Pacemaker¶

After the service have been started and you have verified that the cluster is communicating properly, you can start pacemakerd, the Pacemaker master control process. Choose one from the following four ways to start it:

  1. Start with the LSB init script:

    Alternatively:

    # /etc/init.d/pacemaker start
    # service pacemaker start
  2. Start with upstart:

  3. Start with the systemd unit file:

    # systemctl start pacemaker

After the service has started, Pacemaker creates a default empty cluster configuration with no resources. Use the crm_mon utility to observe the status of :

# crm_mon -1 Last updated: Sun Oct 7 21:07:52 2012Last change: Sun Oct 7 20:46:00 2012 via cibadmin on controller2Stack: openaisCurrent DC: controller2 - partition with quorumVersion: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c3 Nodes configured, 3 expected votes0 Resources configured.Online: [ controller3 controller2 controller1 ]...

Set basic cluster properties¶

After you set up your Pacemaker cluster, set a few basic cluster properties:

  • $ crm configure property pe-warn-series-max="1000"\ pe-input-series-max="1000"\ pe-error-series-max="1000"\ cluster-recheck-interval="5min"
  • $ pcs property set pe-warn-series-max=1000\ pe-input-series-max=1000\ pe-error-series-max=1000\ cluster-recheck-interval=5min

Note the following:

  • Setting the , , and parameters to 1000 instructs Pacemaker to keep a longer history of the inputs processed and errors and warnings generated by its Policy Engine. This history is useful if you need to troubleshoot the cluster.
  • Pacemaker uses an event-driven approach to cluster state processing. The parameter (which defaults to 15 minutes) defines the interval at which certain Pacemaker actions occur. It is usually prudent to reduce this to a shorter interval, such as 5 or 3 minutes.

By default, STONITH is enabled in Pacemaker, but STONITH mechanisms (to shutdown a node via IPMI or ssh) are not configured. In this case Pacemaker will refuse to start any resources. For production cluster it is recommended to configure appropriate STONITH mechanisms. But for demo or testing purposes STONITH can be disabled completely as follows:

  • $ crm configure property stonith-enabled=false
  • $ pcs property set stonith-enabled=false

After you make these changes, commit the updated configuration.

Introduction

High availability, usually abbreviated to "HA", is a term used to describe systems and software frameworks that are designed to preserve application service availability even in the event of failure of a component of the system. The failed component might be software or hardware; the HA framework will attempt to respond to the failure such that the applications running within the framework continue to operate correctly.

While the number of discrete failure scenarios that might be catalogued is potentially very large, they generally fall into one of a very small number of categories:

  1. Failure of the application providing the service
  2. Failure of a software dependency upon which the application relies
  3. Failure of a hardware dependency upon which the application relies
  4. Failure of an external service or infrastructure component upon which the application or supporting framework relies

HA systems protect application availability by grouping sets of servers and software into cooperative units or clusters. HA clusters are typically groups of two or more servers, each running their own operating platform, that communicate with one another over a network connection. HA clusters will often have multi-ported, shared external storage, with each server in the cluster connected over redundant storage paths to the storage hardware.

A cluster software framework manages communication between the cluster participants (nodes). The framework will communicate the health of system hardware and application services between the nodes in the cluster and provide means to manage services and nodes, as well as react to changes in the cluster environment (e.g., server failure).

HA systems are characterized as typically having redundancy in the hardware configuration: two or more servers, each with two or more storage IO paths and often two or more network interfaces configured using bonding or link aggregation. Storage systems will often have similar redundancy characteristics, such as RAID data protection.

Measurements of availability are normally applied to the availability of the applications running on the HA cluster, rather than the hosting infrastructure. For example, loss of a physical server due to a component failure would trigger a failover or migration of the services that the server was providing to another node in the cluster. In this scenario, the outage duration would be the measure of time taken to migrate the applications to another node and restore the applications to running state. The service may be considered degraded until the failed component is repaired and restored, but the HA framework has avoided an ongoing outage.

On systems running an operating system based on Linux, the most commonly used HA cluster framework comprises two software applications used in combination: Pacemaker – to provide resource management – and Corosync – to provide cluster communications and low-level management, such as membership and quorum. Pacemaker can trace its genesis back to the original Linux HA project, called Heartbeat, while Corosync is derived from the OpenAIS project.

Pacemaker and Corosync are widely supported across the major Linux distributions, including Red Hat Enterprise Linux and SuSE Linux Enterprise Server. Red Hat Enterprise Linux version 6 used a very complex HA solution incorporating several other tools, and while this has been simplified since the release of RHEL 6.4, there is still some legacy software in the framework.

With the release of RHEL 7, the high-availability framework from Red Hat has been rationalized around Pacemaker and Corosync v2, simplifying the software environment. Red Hat also provides a command line tool called PCS (Pacemaker and Corosync Shell) that is available for both RHEL version 6 and version 7. PCS provides a consistent system management command interface for the high availability software and abstracts the underlying software implementation.

Note: Lustre does not absolutely need to be incorporated into an HA software framework such as Pacemaker, but doing so enables the operating platform to automatically make decisions about failover/migration of services without operator intervention. HA frameworks also help with general maintenance and management of application resources.

HA Framework Configuration for a Two-Node Cluster

Red Hat Enterprise Linux and CentOS

Red Hat Enterprise Linux version 6 has a complex history with regard to the development and provision of HA software. Prior to version 6.4, Red Hat's high availability software was complex and difficult to install and maintain. With the release of RHEL 6.4 and in all subsequent RHEL 6 updates, this has been consolidated around three principal packages: Pacemaker, Corosync version 1, and CMAN. The software stack was further simplified in RHEL 7 to just Pacemaker and Corosync version 2.

Red Hat EL 6 HA clusters use Pacemaker to provide cluster resource management (CRM), while CMAN is used to provide cluster membership and quorum services. Corosync provides communications but no other services. CMAN is unique to Red Hat Enterprise Linux and is part of an older framework. In RHEL 7, CMAN is no longer required and its functionality is entirely accommodated by Corosync version 2, but for any HA clusters running RHEL 6, Red Hat stipulates the use of CMAN in Pacemaker clusters.

The PCS application (Pacemaker and Corosync Shell) was also introduced in RHEL 6.4 and is available in current releases of both RHEL 6 and 7. PCS simplifies the installation and configuration of HA clusters in Red Hat.

Hardware and Server Infrastructure Prerequisites

This article will demonstrate how to configure a Lustre high-availability building block using two servers and a dedicated external storage array that is connected to both servers. The two-node building block designs for metadata servers and object storage servers provide a suitable basis for deployment of a production-ready, high-availability Lustre parallel file system cluster.

Figure 1 shows a blue-print for typical high-availability Lustre server building blocks, one for the metadata and management services, and one for object storage.

Each server depicted in Figure 1 requires three network interfaces:

  1. A dedicated cluster communication network between paired servers, used as a Corosync communications ring. This can be a cross-over / point-to-point connection, or can be made via a switch.
  2. A management network or public interface connection. This will be used by the HA cluster as an additional communications ring for Corosync.
  3. Public interface, used for connection to the high performance data network – this is the network from which Lustre services will normally be accessed by client computers

A variation on this architecture, not specifically covered in this guide, has a single Corosync communications ring made from two network interfaces that are configured into a bond on a private network. The bond is created per the operating system documented process, and then added to the Corosync configuration.

Software Prerequisites

RHEL / CentOS

In addition to the prerequisites previously described for Lustre, the operating system requires installation of the HA software suite. It may also be necessary to enable optional repositories. For RHEL systems, the command can be used to enable the software entitlements for the HA software packages. For example:

subscription-manager repos \ --enable rhel-ha-for-rhel-7-server-rpms \ --enable rhel-7-server-optional-rpms

or:

subscription-manager repos \ --enable rhel-ha-for-rhel-6-server-rpms \ --enable rhel-6-server-optional-rpms

This step is not required for CentOS. Refer to the documentation for the operating system distribution for more complete information on enabling subscription entitlements.

Install the HA software

RHEL / CentOS

  1. Login as the super-user () on each of the servers in the proposed cluster and install the HA framework software: yum -y install pcs pacemaker corosync fence-agents [cman]

    Note: The package is only required for RHEL/CentOS 6 servers.

  2. On each server, add a user account to be used for cluster management and set a password forthat account. The convention is to create a user account with the name . The user should have been installed as part of the package installation (the account is created during installation of the package). will make use of this account to facilitate cluster management: the  account is used to authenticate the command line application, , with the configuration daemon running on each cluster node ( is used by the application to manage distribution of commands and synchronize the cluster configuration between the nodes).

    The following is taken from the package postinstall script and shows the basic procedure for adding the hacluster account if it does not already exist:

    getent group haclient >/dev/null || groupadd -r haclient -g 189 getent passwd hacluster >/dev/null || useradd -r -g haclient -u 189 -s /sbin/nologin -c "cluster user" hacluster
  3. Set a password for the account. This must be set, and there is no default. Make the password the same on each cluster node: passwd hacluster
  4. Modify or disable the firewall software on each server in the cluster. According to Red Hat, the following ports need to be enabled:
    • TCP: ports 2224, 3121, 21064
    • UDP: ports 5405

    In RHEL 7, the firewall software can be configured to permit cluster traffic as follows:

    firewall-cmd --permanent --add-service=high-availability firewall-cmd --add-service=high-availability

    Verify the firewall configuration:

    firewall-cmd --list-service
  5. Lustre also requires port to be open for incoming connections, and ports 1021-1023 for outgoing connections.
  6. Alternatively, disable the firewall completely.

    For RHEL 7:

    systemctl stop firewalld systemctl disable firewalld
  7. And for RHEL 6:

    chkconfig iptables off chkconfig ip6tables off service iptables stop service ip6tables stop

Note: When working with hostnames in Pacemaker and Corosync, always use the fully qualified domain name to reference cluster nodes.

Configure the Core HA Framework – PCS Instructions

Configure the PCS Daemon

  1. Start the Pacemaker configuration daemon, , on all servers:
    • RHEL 7: systemctl start pcsd.service systemctl enable pcsd.service
    • RHEL 6: service pcsd start chkconfig pcsd on
  2. Verify that the service is running:
    • RHEL 7:
    • RHEL 6:

    The following example is taken from a server running RHEL 7:

    [root@rh7z-mds1 ~]# systemctl start pcsd.service [root@rh7z-mds1 ~]# systemctl status pcsd.service • pcsd.service - PCS GUI and remote configuration interface Loaded: loaded (/usr/lib/systemd/system/pcsd.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2016-04-13 01:30:52 EDT; 1min 11s ago Main PID: 29343 (pcsd) CGroup: /system.slice/pcsd.service ├─29343 /bin/sh /usr/lib/pcsd/pcsd start ├─29347 /bin/bash -c ulimit -S -c 0 >/dev/null 2>&1 ; /usr/bin/ruby -I/usr/lib/pcsd /u... └─29348 /usr/bin/ruby -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb Apr 13 01:30:50 rh7z-mds1 systemd[1]: Starting PCS GUI and remote configuration interface... Apr 13 01:30:52 rh7z-mds1 systemd[1]: Started PCS GUI and remote configuration interface.
  3. Set up PCS authentication by executing the following command on just one of the cluster nodes: pcs cluster auth <node 1> <node 2> [...] -u hacluster

    For example:

    [root@rh7z-mds1 ~]# pcs cluster auth \ > rh7z-mds1.lfs.intl rh7z-mds2.lfs.intl \ > -u hacluster Password: rh7z-mds2.lfs.intl: Authorized rh7z-mds1.lfs.intl: Authorized

Create the Cluster Framework

The command syntax is comprehensive, but not all of the functionality is available for RHEL 6 clusters. For example, the syntax for configuring the redundant ring protocol (RRP) for Corosync communications has only recently been added to RHEL 6.

Unless otherwise stated, the commands in this section are executed on only one node in the cluster.

The command line syntax is:

pcs cluster setup [ --start ] --name <cluster name> \ <node 1 specification> [ <node 2 specification> ] \ [ --transport {udpu|udp} ] \ [ --rrpmode {active|passive} ] \ [ --addr0 <address> ] \ [ --addr1 <address> ] \ [ --mcast0 <address> ] [ --mcastport0 <port> ] \ [ --mcast1 <address> ] [ --mcastport1 <port> ] \ [ --token <timeout> ] [ --join <timeout> ] \ [ ... ]

The node specification is a comma-separated list of hostnames or IP addresses for the host interfaces that will be used for Corosync’s communications. The cluster name is an arbitrary string and will default to  if the option is omitted.

It is possible to create a cluster configuration that comprises a single node. Additional nodes can be added to the cluster configuration at any time after the initial cluster has been created. This can be particularly useful when conducting a major operating system upgrade or server migration, where new servers need to be commissioned and it is necessary to minimize the duration of any outages.

For example, upgrading from RHEL 6 to RHEL 7 usually requires installing the new OS from a clean baseline: there is no "in-place" upgrade path. One way to work around this limitation is to upgrade the nodes one at a time, creating a new framework on the first upgraded node, stopping the resources on the old cluster and recreating them on the new cluster, then rebuilding the second node (and possibly any additional nodes).

The minimum requirement for cluster network communications is a single interface in the cluster configuration, but further interfaces can be added in order to increase the robustness of the HA cluster’s inter-node messaging. Communications are organized into rings, with each ring representing a separate network. Corosync can support multiple rings using a feature called the Redundant Ring Protocol (RRP).

There are two transport types supported by the PCS command: (UDP unicast) and (used for multicast). The transport is recommended, as it is more efficient. , which is the default if no transport is specified†, should only be selected for circumstances where multicast cannot be used.

Note: The default transport may differ, depending on the tools used to create the cluster configuration. According to the man page, the default transport is . However, the man page states that the default transport for RHEL 7 is and the default for RHEL 6 is .

When using (unicast), the Corosync communication rings are determined by the node specification, which is a comma-separated list of hostnames or IP addresses associated with the ring interfaces. For example:

pcs cluster setup --name demo node1-A,node1-B node2-A,node2-B

When the (multicast) transport is chosen, the communications rings are defined by listing the networks upon which the Corosync multicast traffic will be carried, along with an optional list of the multicast addresses and ports that will be used. The rings are specified using the flags and , for example:

pcs cluster setup --name demo node1-A node2-A \ --transport udp \ --addr0 10.70.0.0 --addr1 192.168.227.0

Use network addresses rather than host IP addresses for defining the interfaces, as this will allow a common Corosync configuration to be used across all cluster nodes. If host IP addresses are used, additional manual configuration of Corosync will be required on the cluster nodes. Using network addresses will simplify setup and maintenance.

Note: Corosync cannot parse network addresses supplied in the CIDR (Classless Inter-Domain Routing) notation, e.g., . Always use the full dot notation for specifying networks, e.g. or .

The multicast addresses default to for and for . The default multicast port is for both multicast rings.

Corosync actually uses two multicast ports for communication in each ring. Ports are assigned in receive / send pairs, but only the receive port number is specified when configuring the cluster. The send port is one less than the receive port number (i.e. ). Make sure that there is a gap of at least 1 between assigned ports for a given multicast address in a subnet. Also, if there are several HA clusters with Corosync rings on the same subnet, each cluster will require a unique multicast port pair (different clusters can use the same multicast address, but not the same multicast ports).

For example, if there are six OSSs configured into three HA pairs, and an MDS pair, then each pair of servers will require a unique multicast port for each ring, and that there must be a gap of at least one between the port numbers. So, a range of , , , might be suitable. A range of , , , is not valid because there are no gaps between the numbers to accommodate the send port.

The redundant ring protocol (RRP) mode is specified by the flag. Valid options are: , and . If only one interface is defined, then is automatically selected. If multiple rings are defined, either  or must be used.

When set to , Corosync will send all messages across all interfaces simultaneously. Throughput is not as fast but overall latency is improved, especially when communicating over faulty or unreliable networks.

The setting tells Corosync to use one interface, with the remaining interfaces available as standbys. If the interface fails, one of the standby interfaces will be used instead. This is also the default mode when creating an RRP configuration with .

In theory, the mode provides better reliability across multiple interfaces, while mode may be preferred when the messaging rate is more important. However, the manual page for makes the choice clear and straightforward: only mode is supported by and it is the only mode that receives testing.

The flag specifies the timeout in milliseconds after which a token is declared lost. The default is 1000 (1000ms or 1 second). The value represents the overall length of time before a token is declared lost. Any retransmits occur within this window.

On a Lustre server cluster, the default timeout is generally too short to accommodate variation in response when servers are under heavy load. An otherwise healthy server that is busy can take longer to pass the token to the next server in the ring compared to when the server is idle; if the timeout is too short, the cluster might declare the token lost. If there are too many lost tokens from one node, the cluster framework will consider the node dead.

It is recommended that the value of the parameter be increased significantly from the default. 20000ms is a reasonable, conservative value, but users will want to experiment to find the optimal setting. If the cluster seems to failover too frequently under load, but without any other symptoms, the value should be increased as a first step to see if it alleviates the problem.

PCS Configuration Examples

The following example uses the simplest invocation to create a cluster framework configuration comprising two nodes. This example does not specify a transport, so the default of will be chosen by PCS for cluster communications on RHEL 7, and will be chosed for RHEL 6:

pcs cluster setup --name demo-MDS \ rh7z-mds1.lfs.intl rh7z-mds2.lfs.intl

The next example again uses but incorporates a second, redundant, ring for cluster communications:

pcs cluster setup --name demo-MDS-1-2 \ rh7z-mds1.lfs.intl,192.168.227.11 \ rh7z-mds2.lfs.intl,192.168.227.12

The hostname specification is comma-separated, and the node interfaces are specified in ring priority order. The first interface in the list will join , the second interface will join . In the above example, the interfaces correspond to the hostname for the first node, and for the second node. The interfaces are and for node 1 and node 2 respectively. One could also add the IP addresses for ring1 into the hosts table or DNS if there is a preference to refer to the interfaces by name rather than by address.

The next example demonstrates the syntax for creating a two-node cluster with two Corosync communications rings using multicast:

pcs cluster setup --name demo-MDS-1-2 \ rh7z-mds1.lfs.intl rh7z-mds2.lfs.intl \ --transport udp \ --rrpmode passive \ --token 20000 \ --addr0 10.70.0.0 \ --addr1 192.168.227.0 \ --mcast0 239.255.1.1 --mcastport0 49152 \ --mcast1 239.255.2.1 --mcastport1 49152

This example uses the preferred syntax and configuration for a two-node HA cluster. The names, IP addresses, etc. will be different for each individual installation, but the structure is consistent and is a good template to copy.

Note: The above example will create different results when run on RHEL 6 versus RHEL 7. This is because RHEL 6 uses an additional package called CMAN, which assumes some of the responsibilities that on RHEL 7 are managed entirely by Corosync. Because of this difference, RHEL 6 clusters may behave a little differently to RHEL 7 clusters, even though the commands used to configure each might be identical.

Note: If there are any unexpected or unexplained side-effects when running with RHEL 6 clusters, try simplifying the configuration. For example, try changing the transport from multicast to the simpler unicast configuration, and use the comma-separated syntax to define the node addresses for RRP, rather than using the flags.

Changing the Default Security Key

Changing the default key used by Corosync for communications is optional, but will improve the overall security of the cluster installation. The different operating system distributions and releases have different procedures for managing the cluster framework authentication key, so the following information is provided for information only. Refer to the OS vendor’s documentation for up to date instructions.

The default key can be changed by running the command . The key will be written to the file . Run the command on a single host in the cluster, then copy the resulting key to each node. The file must be owned by the root user and given read-only permissions. Example output follows:

[root@rh7z-mds1 ~]# corosync-keygen Corosync Cluster Engine Authentication key generator. Gathering 1024 bits for key from /dev/random. Press keys on your keyboard to generate entropy. Writing corosync key to /etc/corosync/authkey. [root@rh7z-mds1 ~]# ll /etc/corosync/authkey -r-------- 1 root root 128 Apr 13 23:48 /etc/corosync/authkey

Note: If the key is not the same for every node in the cluster, then they will not be able to communicate with each other to form a cluster. For hosts running Corosync version 2, creating the key and copying to all the nodes should be sufficient. For hosts running RHEL 6 with the CMAN software, the cluster framework also needs to be made aware of the new key:

ccs -f /etc/cluster/cluster.conf \ --setcman keyfile="/etc/corosync/authkey"

Starting and Stopping the cluster framework

To start the cluster framework, issue the following command from one of the cluster nodes:

pcs cluster start [ <node> [<node> ...] | --all ]

To start the cluster framework on the current node only, run the pcs cluster start command without any additional options. To start the cluster on all nodes, supply the flag, and to limit the startup to a specific set of nodes, list them individually on the command line.

To shut down part or all of the cluster framework, issue the command:

pcs cluster stop [ <node> [<node> ...] | --all ]

The parameters for the command are the same as the paramaters for .

Do not configure the cluster software to run automatically on system boot. If an error occurs during the operation of the cluster and a node is isolated and powered off or rebooted as a consequence, it is imperative that the node be repaired, reviewed and restored to a healthy state before committing it back to the cluster framework. Until the root cause of the fault has been isolated and corrected, adding a node back into the framework may be dangerous and could put services and data at risk.

For this reason, ensure that the and services are disabled in the sysvinit or systemd boot sequences:

RHEL 7:

systemctl disable corosync.service systemctl disable pacemaker.service

RHEL 6:

chkconfig cman off chkconfig corosync off chkconfig pacemaker off

However, it is safe to keep the PCS helper daemon, , enabled.

Set Global Cluster Properties

When the cluster framework has been created and is running on at least one of the nodes, set the following global defaults for properties and resources.

The property defines how the cluster will behave when there is a loss of quorum. For two-node HA clusters, this property should be set to , which tells the cluster to keep running. When there are more than two nodes, set the value of the property to .

### For 2 node cluster: ### no_quorum_policy=ignore ### For > 2 node cluster: ### no_quorum_policy=stop pcs property set no-quorum-policy=ignore

The property tells the cluster whether or not there are fencing agents configured on the cluster. If set to (strongly recommended and essential for any production deployment), the cluster will try to fence any nodes that are running resources that cannot be stopped. The cluster will also refuse to start any resources unless there is at least one STONITH resource configured.

The property should only ever be set to when the cluster will be used for demonstration purposes.

### values: true (default) or false pcs property set stonith-enabled=true

When is set equal to , this indicates that all of the nodes in the cluster have equivalent configurations and are equally capable of running any of the defined resources. For a simple two-node cluster with shared storage, as is commonly used for Lustre services, should nearly always be set to .

### values: true (default) or false pcs property set symmetric-cluster=true

is a resource property that defines how much a resource prefers to stay on the node where it is currently running. The higher the value, the more sticky the resource, and the less likely it is to migrate automatically to its most preferred location if it is running on a non-preferred / non-default node in the cluster and the resource is healthy. affects the behaviour of .

If a resource is running on a non-preferred node, and the resource is healthy, it will not be migrated automatically back to its preferred node. If the stickiness is higher than the preference score of a resource, the resource will not move automatically while the machine it is running on remains healthy.

The default value is 0 (zero). It's common to set the value greater than 100 as an indicator that the resource should not be disrupted by migrating it automatically if the resource and the node it is running on are both healthy.

pcs resource defaults resource-stickiness=200

Verify cluster configuration and status

To view overall cluster status:

pcs status [ <options> | --help]

For example:

[root@rh7z-mds1 ~]# pcs status Cluster name: demo-MDS-1-2 WARNING: no stonith devices and stonith-enabled is not false Last updated: Thu Apr 14 00:58:29 2016 Last change: Wed Apr 13 21:16:13 2016 by hacluster via crmd on rh7z-mds1.lfs.intl Stack: corosync Current DC: rh7z-mds1.lfs.intl (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum 2 nodes and 0 resources configured Online: [ rh7z-mds1.lfs.intl rh7z-mds2.lfs.intl ] Full list of resources: PCSD Status: rh7z-mds1.lfs.intl: Online rh7z-mds2.lfs.intl: Online Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled </code> To review the cluster configuration: <pre style="overflow-x:auto;"> pcs cluster cib

The output will be in the CIB XML format.

The Corosync run-time configuration can also be reviewed:

  • RHEL 7 / Corosync v2:
  • RHEL 6 / Corosync v1:

This can be very useful when verifying specific changes to the cluster communications configuration, such as the RRP setup. For example:

[root@rh7z-mds1 ~]# corosync-cmapctl | grep interface totem.interface.0.bindnetaddr (str) = 10.70.0.0 totem.interface.0.mcastaddr (str) = 239.255.1.1 totem.interface.0.mcastport (u16) = 49152 totem.interface.1.bindnetaddr (str) = 192.168.227.0 totem.interface.1.mcastaddr (str) = 239.255.2.1 totem.interface.1.mcastport (u16) = 49152 </code> To check the status of the Corosync rings: <pre style="overflow-x:auto;"> [root@rh7z-mds1 ~]# corosync-cfgtool -s Printing ring status. Local node ID 1 RING ID 0 id = 10.70.227.11 status = ring 0 active with no faults RING ID 1 id = 192.168.227.11 status = ring 1 active with no faults

To get the cluster status from CMAN on RHEL 6 clusters:

[root@rh6-mds1 ~]# cman_tool status Version: 6.2.0 Config Version: 14 Cluster Name: demo-MDS-1-2 Cluster Id: 28594 Cluster Member: Yes Cluster Generation: 24 Membership state: Cluster-Member Nodes: 2 Expected votes: 1 Total votes: 2 Node votes: 1 Quorum: 1 Active subsystems: 9 Flags: 2node Ports Bound: 0 Node name: rh6-mds1.lfs.intl Node ID: 1 Multicast addresses: 239.255.1.1 239.255.2.1 Node addresses: 10.70.206.11 192.168.206.11

If the cluster appears to start, but there are errors reported by and in the syslog related to Corosync totem, then there may be a conflict in the multicast address configuration with another cluster or service on the same subnet. A typical error in the syslog would look similar to the following output:

Apr 13 22:11:15 rh67-pe corosync[26370]: [TOTEM ] Received message has invalid digest... ignoring. Apr 13 22:11:15 rh67-pe corosync[26370]: [TOTEM ] Invalid packet data

These errors indicate that the node has intercepted traffic intended for a node on a different cluster.

Also be careful in the definition of the network and multicast addresses. will often create the configuration without complaint, and the cluster framework may even load without reporting any errors to the command shell. However, a misconfiguration may lead to a failure in the RRP that it not immediately obvious. Look for unexpected information in the Corosync database and the cluster CIB.

For example, if one of the cluster node addresses shows up as or , this indicates a problem with the addresses supplied to with the or flags.

Next Steps

Figure 1. Lustre Server High-Availability Building Blocks

0 comments

Leave a Reply

Your email address will not be published. Required fields are marked *