Cluster down-paloaltonetworks-panos

error
high-availability
panos
paloaltonetworks
Cluster down-paloaltonetworks-panos
0

#1

Cluster down-paloaltonetworks-panos

Vendor: paloaltonetworks

OS: panos

Description:
Indeni will alert if a cluster is down or any of the members are inoperable.

Remediation Steps:
Review the cause for one or more members being down or inoperable.
Log into the device over SSH and run “less mp-log ha-agent.log” for more information.

How does this work?
This script uses the Palo Alto Networks API to retrieve the status of the high availability function of the cluster and specifically retrieves the local member’s and peer’s states.

Why is this important?
Tracking the state of a cluster is important. If a cluster which used to be healthy no longer is, it may be the result of an issue. In some cases, it is due to maintenance work (and so was anticipated), but in others it may be due to a failure in the members of the cluster or another component in the network.

Without Indeni how would you find this?
The status of high availability is visible in the web interface, as a widget on the main screen.

panos-show-high-availability-all-monitoring

#! META
name: panos-show-high-availability-all-monitoring
description: Track health of HA
type: monitoring
monitoring_interval: 5 minute
requires:
    vendor: paloaltonetworks
    os.name: panos
    "high-availability": true

#! COMMENTS
cluster-member-active:
    why: |
        Tracking the state of a cluster member is important. If a cluster member which used to be the active member of the cluster no longer is, it may be the result of an issue. In some cases, it is due to maintenance work (and so was anticipated), but in others it may be due to a failure in the firewall or another component in the network.
    how: |
        This script uses the Palo Alto Networks API to retrieve the status of the high availability function of the firewall and specifically retrieves the local member's state.
    without-indeni: |
        The status of high availability is visible in the web interface, as a widget on the main screen.
    can-with-snmp: true
    can-with-syslog: true
cluster-state:
    why: |
        Tracking the state of a cluster is important. If a cluster which used to be healthy no longer is, it may be the result of an issue. In some cases, it is due to maintenance work (and so was anticipated), but in others it may be due to a failure in the members of the cluster or another component in the network.
    how: |
        This script uses the Palo Alto Networks API to retrieve the status of the high availability function of the cluster and specifically retrieves the local member's and peer's states.
    without-indeni: |
        The status of high availability is visible in the web interface, as a widget on the main screen.
    can-with-snmp: true
    can-with-syslog: true
cluster-preemption-enabled:
    why: |
        Preemption is a function in clustering which sets a primary member of the cluster to always strive to be the active member. The trouble with this is that if the active member that is set with preemption on has a critical failure and reboots, the cluster will fail over to the secondary and then immediately fail over back to the primary when it completes the reboot. This can result in another crash and the process would happen again and again in a loop. The Palo Alto Networks firewalls have a means of dealing with this ( https://live.paloaltonetworks.com/t5/Learning-Articles/Understanding-Preemption-with-the-Configured-Device-Priority-in/ta-p/53398 ) but it is generally a good idea not to have the preemption feature enabled.
    how: |
        This script uses the Palo Alto Networks API to retrieve the status of the high availability function of this cluster member and specifically the preemption setting.
    without-indeni: |
        Going into a preemption loop is difficult to detect. Normally an administrator will notice service disruption. Then through manual inspection the administrator will determine there is a preemption loop.
    can-with-snmp: true
    can-with-syslog: true
cluster-config-synced:
    why: |
        Normally two Palo Alto Networks firewalls in a cluster work together to ensure their configurations are synchronized. Sometimes, due to connectivity or other issues, the configuration sync may be lost. In the event of a fail over, the secondary member will take over but will be running with a different configuration compared to the primary (the original active member). This can result in service disruption.
    how: |
        This script uses the Palo Alto Networks API to retrieve the status of the high availability function of this cluster and specifically the status of the config synchronization.
    without-indeni: |
        The status of configuration sync is visible in the web interface, as a widget on the main screen.
    can-with-snmp: true
    can-with-syslog: true
device-is-passive:
    why: |
        This metric describe whether this device is a passive device. For passive device, port down alert should not be triggered.
    how: |
        This script uses the Palo Alto Networks API to retrieve the active/passive state of the device.
    without-indeni: |
        The active/passive status is visible in the web interface.
    can-with-snmp: true
    can-with-syslog: true
passive-link-state:
    why: |
        This metric describe whether this the passive-link-state is shutdown or auto. If it is shutdown we can use this metric to not to trigger alerts when ports are in power-down state as expected behavior.
    how: |
        This script uses the Palo Alto Networks API to retrieve the passive-link-state state of the device.
    without-indeni: |
        The passive-link-state status can be found via the web interface or the cli.
    can-with-snmp: true
    can-with-syslog: true

#! REMOTE::HTTP
url: /api?type=op&cmd=<show><high-availability><all></all></high-availability></show>&key=${api-key}
protocol: HTTPS

#! PARSER::XML
_vars:
    root: /response/result
_metrics:
    -
        _tags:
            "im.name":
                _constant: "cluster-member-active"
            "name":
                _constant: "Firewall Clustering"
            "live-config":
                _constant: "true"
            "display-name":
                _constant: "Cluster Member State (this)"
            "im.dstype.displayType":
                _constant: "state"
        _temp:
            state:
                _text: "${root}/group/local-info/state"
        _transform:
            _value.double: |
                {
                    if (temp("state") ~ /^(active|active-primary|active-secondary)/) {
                        print "1"
                    } else {
                        print "0"
                    }
                }
    -
        _tags:
            "im.name":
                _constant: "device-is-passive"
        _temp:
            state:
                _text: "${root}/group/local-info/state"
        _transform:
            _value.double: |
                {
                    if (temp("state") ~ /^(active|active-primary|active-secondary)/) {
                        print "0"
                    } else {
                        print "1"
                    }
                }
    -
        _tags:
            "im.name":
                _constant: "passive-link-state"
        _temp:
            "passivelinkstate":
                _count: "${root}/group/local-info/active-passive/passive-link-state[. = 'shutdown']"
        _transform:
            _value.double: |
                {
                    if (temp("passivelinkstate") > 0) {
                        print "0"
                    } else {
                        print "1"
                    }
                }
    -
        _tags:
            "im.name":
                _constant: "cluster-state"
            "name":
                _constant: "Firewall Clustering"
            "live-config":
                _constant: "true"
            "display-name":
                _constant: "Cluster State"
            "im.dstype.displayType":
                _constant: "state"
        _temp:
            localstate:
                _text: "${root}/group/local-info/state"
            peerstate:
                _text: "${root}/group/peer-info/state"
        _transform:
            _value.double: |
                {
                    if (temp("localstate") != "down" && temp("peerstate") != "down" && temp("peerstate") != "unknown" && temp("peerstate") != "suspended") {
                        print "1"
                    } else {
                        print "0"
                    }
                }
    -
        _tags:
            "im.name":
                _constant: "cluster-config-synced"
            "live-config":
                _constant: "true"
            "display-name":
                _constant: "Cluster Configuration Synced"
            "im.dstype.displayType":
                _constant: "boolean"
        _temp:
            runningsync:
                _text: "${root}/group/running-sync"
        _transform:
            _value.double: |
                {
                    if (temp("runningsync") == "synchronized") {
                        print "1"
                    } else {
                        print "0"
                    }
                }
    -
        _tags:
            "im.name":
                _constant: "cluster-preemption-enabled"
            "live-config":
                _constant: "true"
            "display-name":
                _constant: "Preemption Enabled"
            "im.dstype.displayType":
                _constant: "boolean"
        _temp:
            preemptive:
                _text: "${root}/group/local-info/preemptive"
        _transform:
            _value.double: |
                {
                    if (temp("preemptive") == "yes") {
                        print "1"
                    } else {
                        print "0"
                    }
                }

cross_vendor_cluster_down_novsx

package com.indeni.server.rules.library.templatebased.crossvendor

import com.indeni.ruleengine.expressions.conditions.{Equals => RuleEquals, Not => RuleNot, Or => RuleOr}
import com.indeni.server.common.data.conditions.{Equals => DataEquals, Not => DataNot}
import com.indeni.server.rules.RuleContext
import com.indeni.server.rules.library._
import com.indeni.server.rules.library.templates.StateDownTemplateRule

/**
  *
  */
case class cross_vendor_cluster_down_novsx() extends StateDownTemplateRule(
  ruleName = "cross_vendor_cluster_down_novsx",
  ruleFriendlyName = "Clustered Devices (Non-VS): Cluster down",
  ruleDescription = "Indeni will alert if a cluster is down or any of the members are inoperable.",
  metricName = "cluster-state",
  applicableMetricTag = "name",
  metaCondition = !DataEquals("vsx", "true"),
  alertItemsHeader = "Clustering Elements Affected",
  alertDescription = "One or more clustering elements in this device are down. This alert was added per the request of <a target=\"_blank\" href=\"http://il.linkedin.com/pub/gal-vitenberg/83/484/103\">Gal Vitenberg</a>.",
  baseRemediationText = "Review the cause for one or more members being down or inoperable.")(
  ConditionalRemediationSteps.VENDOR_CP -> "Review other alerts for a cause for the cluster failure.",
  ConditionalRemediationSteps.VENDOR_PANOS -> "Log into the device over SSH and run \"less mp-log ha-agent.log\" for more information.",
  ConditionalRemediationSteps.OS_NXOS ->
    """|
      |1. Verify the communication between the FHRP peers . A random, momentary loss of data communication between the peers is the most common problem that results in continuous FHRP state change (ACT<-> STB) unless this error message occurs during the initial installation.
      |2. Check the CPU utilization by using the "show process CPU" NX-OS command. FHRP state changes are often due to High CPU Utilization.
      |3. Common problems for the loss of FHRP packets between the peers to investigate are physical layer problems, excessive network traffic caused by spanning tree issues or excessive traffic caused by each Vlan.
      |
      |In the case of a vPC problem, validate the following:
      |1. Check that STP bridge assurance is not enabled on the vPC links. Bridge assurance should only be enabled on the vPC peer link
      |2. Compare the vPC domain IDs of the two switches and ensure that they match. Execute the "show vpc brief"  to compare the output that should match across the vPC peer switches.
      |3. Verify that both the source and destination IP addresses used for the peer-keepalive messages are reachable from the VRF associated with the vPC peer-keepalive link.
      |Then, execute the "sh vpc peer-keepalive" NX-OS command and review the output from both switches.
      |4. Verify that the peer-keepalive link is up. Otherwise, the vPC peer link will not come up.
      |5. Review the vPC peer link configuration, execute the "sh vpc brief" NX-OS command and review the output. Besides, verify that the vPC peer link is configured as a Layer 2 port channel trunk that allows only vPC VLANs.
      |6. Ensure that type 1 consistency parameters match. If they do not match, then vPC is suspended. Items that are type 2 do not have to match on both Nexus switches for the vPC to be operational. Execute the "sh vpc consistency-parameters" command and review the output
      |7. Verify that the vPC number that you assigned to the port channel that connects to the downstream device from the vPC peer device is identical on both vPC peer devices
      |8. If you manually configured the system priority, verify that you assigned the same priority value on both vPC peer devices
      |9. Verify that the primary vPC is the primary STP root and the secondary vPC is the secondary STP root.
      |10. Review the logs for relevant findings
      |11. For more information please review  the next  vPC troubleshooting guide:
      |https://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus5000/sw/troubleshooting/guide/N5K_Troubleshooting_Guide/n5K_ts_vpc.html""".stripMargin
)