Cluster has preemption enabled-juniper-junos

discobot · November 22, 2018, 1:42pm

Cluster has preemption enabled-juniper-junos

Vendor: juniper

OS: junos

Description:
Preemption is generally a bad idea in clustering, although sometimes it is the default setting. indeni will trigger an issue if it’s on.

Remediation Steps:
It is generally best to have preemption disabled. Instead, once this device returns from a crash, you can conduct the failover manually.
||1. Generally, it is recommended to have preemption disabled. Instead, once this device returns from a crash, you can conduct the failover manually.
|2. If preemption is added to a redundancy group configuration, the device with the high priority in the group can initiate a failover to become a master.
|3. On the device command line interface execute “request chassis cluster failover node” or “request chassis cluster failover redundancy-group” commands to override the priority setting and preemption.
|4. Review the following article on Juniper TechLibrary for more information: Operational Commands: request chassis cluster failover node

How does this work?
This script logs into the Juniper JUNOS-based device using SSH and retrieves the output of the “show chassis cluster status” command. The output includes the status of all redundancy groups across the cluster.

Why is this important?
Preemption is a function in clustering which sets a primary member of the cluster to always strive to be the active member. The trouble with this is that if the active member that is set with preemption on has a critical failure and reboots, the cluster will fail over to the secondary and then immediately fail over back to the primary when it completes the reboot. This can result in another crash and the process would happen again and again in a loop.

Without Indeni how would you find this?
The administrator has to run the “show chassis cluster status” on the device to find whether preemption is enabled and correctly configured if one of the nodes is expected to be always primary node.

junos-show-chassis-cluster-status

name: junos-show-chassis-cluster-status
description: JUNOS collect clustering status
type: monitoring
monitoring_interval: 1 minute
requires:
    vendor: juniper
    os.name: junos
    product: firewall
    high-availability: true
comments:
    cluster-member-active:
        why: |
            Tracking the state of a cluster member is important. If a cluster member which used to be the active member of the cluster no longer is, it may be the result of an issue. In some cases, it is due to maintenance work (and so was anticipated), but in others it may be due to a failure in the firewall or another component in the network.
        how: |
            This script logs into the Juniper JUNOS-based device using SSH and retrieves the output of the "show chassis cluster status" command. The output includes the status of all redundancy groups across the cluster.
        can-with-snmp: true
        can-with-syslog: true
    cluster-state:
        why: |
            Tracking the state of a cluster is important. If a cluster which used to be healthy no longer is, it may be the result of an issue. In some cases, it is due to maintenance work (and so was anticipated), but in others it may be due to a failure in the members of the cluster or another component in the network.
        how: |
            This script logs into the Juniper JUNOS-based device using SSH and retrieves the output of the "show chassis cluster status" command. The output includes the status of all redundancy groups across the cluster.
        can-with-snmp: true
        can-with-syslog: true
    cluster-preemption-enabled:
        why: |
            Preemption is a function in clustering which sets a primary member of the cluster to always strive to be the active member. The trouble with this is that if the active member that is set with preemption on has a critical failure and reboots, the cluster will fail over to the secondary and then immediately fail over back to the primary when it completes the reboot. This can result in another crash and the process would happen again and again in a loop.
        how: |
            This script logs into the Juniper JUNOS-based device using SSH and retrieves the output of the "show chassis cluster status" command. The output includes the status of all redundancy groups across the cluster.
        can-with-snmp: false
        can-with-syslog: false
steps:
-   run:
        type: SSH
        file: show-chassis-cluster-status.remote.1.bash
    parse:
        type: AWK
        file: show-chassis-cluster-status.parser.1.awk

cross_vendor_cluster_preempt

package com.indeni.server.rules.library.core
import com.indeni.apidata.time.TimeSpan
import com.indeni.ruleengine.expressions.conditions.Equals
import com.indeni.ruleengine.expressions.core.{ConstantExpression, StatusTreeExpression}
import com.indeni.ruleengine.expressions.data.{SelectTagsExpression, SelectTimeSeriesExpression, TimeSeriesExpression}
import com.indeni.server.common.data.conditions.True
import com.indeni.server.rules.library.{ConditionalRemediationSteps, PerDeviceRule, RuleHelper}
import com.indeni.server.rules.{DeviceCategory, DeviceKey, RemediationStepCondition, RuleCategory, RuleContext, RuleMetadata}
import com.indeni.server.sensor.models.managementprocess.alerts.dto.AlertSeverity

case class ClusterPreemptionEnabledRule() extends PerDeviceRule with RuleHelper {

  override val metadata: RuleMetadata = RuleMetadata.builder("cross_vendor_cluster_preempt", "Cluster has preemption enabled",
    "Preemption is generally a bad idea in clustering, although sometimes it is the default setting. indeni will trigger an issue if it's on.",
    AlertSeverity.WARN, categories = Set(RuleCategory.HighAvailability), deviceCategory = DeviceCategory.ClusteredDevices).interval(TimeSpan.fromMinutes(5)).build()


  override def expressionTree(context: RuleContext): StatusTreeExpression = {
    val inUseValue = TimeSeriesExpression[Double]("cluster-preemption-enabled").last

    StatusTreeExpression(
      // Which objects to pull (normally, devices)
      SelectTagsExpression(context.metaDao, Set(DeviceKey), True),

      StatusTreeExpression(
        // The time-series we check the test condition against:
        SelectTimeSeriesExpression[Double](context.tsDao, Set("cluster-preemption-enabled"), denseOnly = false),

        // The condition which, if true, we have an issue. Checked against the time-series we've collected
        Equals(
          inUseValue,
          ConstantExpression[Option[Double]](Some(1.0)))
      ).withoutInfo().asCondition()

      // Details of the alert itself
    ).withRootInfo(
      getHeadline(),
      ConstantExpression("This cluster member has preemption enabled. This means that it will have priority over other cluster members. If this device reboots or crashes, it'll try to assume priority in the cluster when it finishes its boot process. This may result in it crashing again, and causing a preemption loop."),
      ConditionalRemediationSteps("It is generally best to have preemption disabled. Instead, once this device returns from a crash, you can conduct the failover manually.",
        RemediationStepCondition.VENDOR_PANOS ->
          """|Palo Alto Networks firewalls have a special way of handling preemption loops, review the following article:
             |<a target="_blank" href="https://live.paloaltonetworks.com/t5/Learning-Articles/Understanding-Preemption-with-the-Configured-Device-Priority-in/ta-p/53398">Understanding Preemption with the Configured Device Priority in HA Active/Passive Mode</a>.""".stripMargin,
        RemediationStepCondition.VENDOR_CISCO ->
          """|FHRP preemption and delays features are not required. The vPC will forward traffic as soon as the links become available. Once a device recovers from a crash or reboot, you can conduct the failover manually.
             |Cisco recommends:
             |1. Configuring the FHRP with the default settings and without preempt when using vPC.
             |2. Make the vPC primary switch the FHRP active switch. This is not intended to improve performance or stability. It does make one switch responsible for the control plane traffic. This is a little easier on the administrator while troubleshooting.""".stripMargin,
        RemediationStepCondition.VENDOR_JUNIPER ->
          """1. Generally, it is recommended to have preemption disabled. Instead, once this device returns from a crash, you can conduct the failover manually.
            |2. If preemption is added to a redundancy group configuration, the device with the high priority in the group can initiate a failover to become a master.
            |3. On the device command line interface execute "request chassis cluster failover node"  or  "request chassis cluster failover redundancy-group"  commands to override the priority setting and preemption.
            |4. Review the following article on Juniper TechLibrary for more information: <a target="_blank" href="https://www.juniper.net/documentation/en_US/junos/topics/reference/command-summary/request-chassis-cluster-failover-node.html">Operational Commands: request chassis cluster failover node</a>""".stripMargin
      )
    )
  }
}