Thoughts on Design and Behavior of a Weak Membership Model

From Genunix

Jump to: navigation, search

Contents

Authors

anawithacirclearoundit hamburg notacolon de)

Introduction

This page/document contains a collection of ideas about how a weak membership model in a cluster framework like OHAC might be designed and how it might behave. Many of these thoughts are taken from or inspired by the [http://opensolaris.org/os/project/colorado/Requirements/ Project Colorado Requirements documents]. Though it is hoped that this document may contribute to the success of Project Colorado, it is for discussion only and without any obligation to the implementation of Colorado.

Fundamental differences between strong membership and weak membership

Strong membership

  • guarantees that whenever a cluster gets partitioned (split-brain: some nodes cannot communicate via the cluster interconnect) only one partition of the cluster will survive and all other nodes will leave the cluster immediately
  • as a consequence, strong membership guarantees that only one instance of any failover resource (application, storage, network address) may run / be accessed at any time.
    We will call the requirement on this behavior the single instance requirement and applications with such a requirement single instance applications.

Weak membership in contrast

  • allows for more than one partition of a cluster to exist in some situations
  • and as a consequence the cluster framework may allow for a failover resource (application, storage, network address) to be started / accessed on/from several cluster nodes simultaneously
    We will call applications which are robust or simple enough to tolerate such situations multi instance tolerant applications.

weak membership for single instance applications

Many applications have a single instance requirement as starting them more than once or accessing the same storage resource simultaneously and concurrently will have disastrous consequences.

Such applications require a cluster configuration where even in a weak membership model it is at least unlikely that multiple instances of the application may ever run. Operator intervention and decision making may be needed for running such applications under a weak membership model, the operator may need to be asked to give certain guarantees like "all other nodes are powered off".

When designing a weak membership cluster for single instance applications, the specific properties of the weak membership model need to be carefully inspected and understood properly. When in doubt, the weak membership model should not be used and strong membership should be preferred.

weak membership for multi instance tolerant applications

The weak membership model is particularly suitable for applications which do not require to save persistent state information or where loss of such state information is uncritical. Such applications may (mainly) access data read-only (like a web server for static content or a directory server) or, besides configuration information, not access any data at all (like a load balancing or compute farm application).

For such applications, the cluster framework can be allowed to start an instance even if it cannot guarantee that it will be the only instance running.

where is the data

Access to data for a multi instance tolerant application may either be provided

  • internal to the cluster hosting the application or
  • external to the cluster, for example via file oriented access like NFS or CIFS, via block oriented access like iSCSI or as a database service
Storage internal to the cluster

If a multi instance tolerant application does need to modify persistent state (information on storage), an originally single instance of the state information (data) may have to be split into multiple, diverging copies when a cluster gets partitioned.

When reforming the cluster, a conflict may arise as the possibly diverged copies of the data must somehow be transformed into a single instance. A simple approach to resolve this conflict is to choose one instance of the data, loosing all changes applied to all other copies of the data. More sophisticated ways to merge diverged data sets may be possible depending on the application.

Storage external to the cluster

When storage external to the cluster is used to store state information of a multi instance tolerant application, the application should ensure that simultaneous access is arbitrated amongst the instances of the application, even if they cannot communicate amongst each other but only with the storage device. Such arbitration may be implemented using file locking protocols or using a database access protocol.

contrasting multi instance tolerant applications versus scalable applications

While a scalable application will be, by design, started multiple times in a cluster, a multi instance tolerant application is not meant to be running with more than once instance at a time, but can tolerate to be.

While a scalable application will be provided by the cluster means to simultaneously access a single instance of shared data (e.g. as a file system or database service) from multiple application instances, to provide a single instance of shared data may not be possible under a weak membership model.

Design options for a weak membership model of OHAC

The basis of a weak membership model is dynamic cluster membership state information, maintained by the Cluster Membership Monitor (CMM). While for the strong membership model, uncertainty about another node's state is minimized (for instance by the quorum condition, failure fencing and failfast), with a weak membership model, uncertainty about the state of other nodes will last for longer periods in a cluster partitioning situation (split brain).

It might be advantageous to introduce more detailed state information to complement the states for cluster nodes currently maintained by CMM. For instance, a test might determine that a node is not guaranteed to be dead (e.g. S_NOTDEAD), which is more than S_UNKNOWN. In the following section, previously undefined state names will be used at will to illustrate the ideas.

Determining the node state during partitioning

While with strong membership and the quorum condition, the decision problem about a node's membership can be resolved quickly, additional means to inquire about node state may be needed with a weak membership model.

The Colorado HACI requirements document (http://opensolaris.org/os/project/colorado/Requirements/colorado-haci.pdf) introduces a connectivity check to guarantee that a node will transition to S_DEAD if a target IP address cannot be pinged.

Other options may allow a node to inquire information about other nodes' states with varying grades of certainty:

Tests allowing information gain S_UNKNOWN -> S_NOTDEAD

  • Public address ping: If a public network address (e.g. IP & ARP address) known to be used only by one specific cluster node is determined to be still reachable, the node will most likely not be dead
  • Mailbox disks: For clusters using shared storage, devices like FC or iSCSI LUNs could be used as mailbox disks to which each node of the cluster guarantees to write a timestamp with a minimum frequency.
    If cluster time is guaranteed to diverge only within a well defined limit, a node can assume another node to be S_NOTDEAD if this node's mailbox disk timestamps are still current.
    The reverse is not guaranteed to be true, so outdated timestamps on mailbox disks do not mean another node is dead.
  • LOM agent: Cluster nodes could query out-of-band / lights out management facilities about a node's state. A LOM agent may be used to set a node to S_NOTDEAD if the LOM reports it to be powered on.

Tests allowing information gain S_UNKNOWN -> S_DEAD

  • LOM agent: Cluster nodes could query out-of-band / lights out management facilities about a node's state. A LOM agent may be used to set a node to S_DEAD if the LOM reports it to be powered off.
  • Admin intervention agent: An administrator could manually declare a node dead

Other ideads

  • A simple Solaris agent could be used to report to other cluster nodes whether or not a cluster framework is running on a node. This way, cluster nodes could determine a node to not participate in the cluster (e.g. S_NOCLUSTER) though it is S_NOTDEAD.

Failover behavior during partitioning

Depending on whether or not an application is multi instance tolerant and weighing availability requirements against data integrity requirements, an administrator may wish to configure whether or not applications should be started within a cluster partition while a split brain exists.

partitioning start behavior

For resources or resource groups, a property could be defined to choose the desired behavior. This property, which could be called partitioning start behavior, could have values similar to the following:

  • ALWAYS: Always (try to) bring up this resource in the partition of the current node, no matter what state is known about potential masters outside the partition.
  • INDOUBT: (try to) bring up this resource in the partition of the current node only if no potential master outside the partition is known to be running (= no potential master outside the partition is S_NOTDEAD).
  • SAFE: Only bring up this resource if the cluster framework can assure that no other potential master is running it (note: the nodelist is a resource group property, so only the state of those nodes would have to be checked)
    The check would have to ensure that all other potential masters are either members of the same partition as the node doing the check, are really down (S_DEAD, S_DOWN etc.) or not running the cluster framework software (S_NOCLUSTER).
    For nodes which the cluster framework cannot determine to be dead, the administrator could ensure them to be dead (e.g. by powering off) and use the admin intervention agent to inform the cluster framework about the state change.
  • SAFE_STONITH (could be the default as it is similar to current behavior with strong membership): Same as SAFE, but try to shut down any node which is not known to be down.
    This behavior is similar to that implemented by strong membership, because the quorum condition will ensure that any node not being a member of the surviving partition will shut itself down and/or be fenced off.
STONITH methods

STONITH (shoot the other node in the head) methods could comprise the following:

  • IOFENCE: Use volume manager techniques (scsi reservations) to ensure that no other node can access shared needed by the resource.
    This might also be an implicit behavior of other partitioning failover modes if volume managers on shared storage are used.
  • LOM agent: Use out-of-band management facilities to shut off the other node

Defined node to run service during partitioning

Additionally or alternatively, one of the possible masters of a resource or device group could be declared the only one to host the service/application in a partitioning scenario.

One option is something like a “freeze mode” where only the node which was running the resource before partitioning continues hosting it.

Another option would be to define one of the possible masters to take over the service in a partitioning situation, but it would be problematic to decide when exactly this node should start the service as it cannot know when the previous master has finished shutting down the service if the previous master is not in the same partition.

Rejoining

When partitions rejoin after the reasons for split-brain are resolved, they might even be running the same resources and might have diverged copies of the cluster configuration.

Several conflicts will need to be resolved:

  • The cluster configuration will need to be merged
    • One particular case of a CCR conflict exists when multiple instances of the same failover resource are running. It needs to be determined which of those should continue running, which should be shut down and which should possibly be restarted.

A rejoin cannot be completed until those conflicts have been resolved (see also section 4.6 of the HACI requirements document). The cluster framework could assist the administrator in resolving the conflicts by:

  • creating a detailed report about existing conflicts
  • optionally complemented by information about steps to be taken or commands to be executed to resolve the conflicts
  • optionally complemented by automation of routine merge jobs like choosing from a set of diverged copies of previously identical data the copy to survive

Amnesia with weak membership

While the implementation of the quorum condition (using persistent scsi reservations or a quorum server) makes it unlikely that amnesia will happen with the strong membership model, amnesia is more likely to happen under the weak membership model because a node separated from any possibly running partition will always come up with its potentially outdated CCR copy.

Conflict resolution for rejoining a cluster with amnesia might need special consideration like reporting on the date/time a part of the configuration was last changed.

Personal tools