Indeni will alert if a cluster is down or any of the members are inoperable.
Review the cause for one or more members being down or inoperable.
How does this work?
This alert logs into the F5 device through SSH to verify that each traffic group has an active member.
Why is this important?
Tracking the state of a cluster is important. If a cluster which used to be healthy no longer is, it may be the result of an issue. In some cases, it is due to maintenance work (and so was anticipated), but in others it may be due to a failure in the members of the cluster or another component in the network.
Without Indeni how would you find this?
Problems with a cluster state is generally detected by that the units in question does not process traffic. An administrator could verify that each traffic group has an active member by logging into the device through SSH, entering TMSH and executing the command “show cm”. This would bring up details of the cluster state.
name: f5-show-cm description: Get cluster information type: monitoring monitoring_interval: 5 minutes requires: vendor: f5 high-availability: 'true' product: load-balancer shell: bash comments: known-devices: why: | To make it easier to add devices to indeni, the cluster members are extracted. how: | This alert logs into the F5 device through SSH and extracts the known cluster members. can-with-snmp: false can-with-syslog: false cluster-state: why: | Tracking the state of a cluster is important. If a cluster which used to be healthy no longer is, it may be the result of an issue. In some cases, it is due to maintenance work (and so was anticipated), but in others it may be due to a failure in the members of the cluster or another component in the network. how: | This alert logs into the F5 device through SSH to verify that each traffic group has an active member. can-with-snmp: true can-with-syslog: false cluster-member-active: why: | Tracking the state of a cluster member is important. If a cluster member which used to be the active member of the cluster no longer is, it may be the result of an issue. In some cases, it is due to maintenance work (and so was anticipated), but in others it may be due to a failure in the firewall or another component in the network. how: | This alert logs into the F5 device through SSH and retrieves the local member's state. can-with-snmp: true can-with-syslog: false cluster-config-synced: why: | It is normally desireable for clusters to have their configuration synced. Else, changes made on one node in a cluster might not be active in the event of a fail over. This might cause disruption. how: | This alert logs into the F5 device through SSH and retrieves the current state of the configuration synchronization. can-with-snmp: true can-with-syslog: false steps: - run: type: SSH command: tmsh -q show cm parse: type: AWK file: tmsh-show-cm.parser.1.awk
// Deprecation warning : Scala template-based rules are deprecated. Please use YAML format rules instead. package com.indeni.server.rules.library.templatebased.crossvendor import com.indeni.server.rules.RuleContext import com.indeni.server.rules.library.templates.StateDownTemplateRule import com.indeni.server.rules.RemediationStepCondition /** * */ case class cross_vendor_cluster_down_vsx() extends StateDownTemplateRule( ruleName = "cross_vendor_cluster_down_vsx", ruleFriendlyName = "Clustered Devices (VS): Cluster down", ruleDescription = "Indeni will alert if a cluster is down or any of the members are inoperable.", metricName = "cluster-state", applicableMetricTag = "name", descriptionMetricTag = "vs.name", historyLength = 2, alertItemsHeader = "Clustering Elements Affected", alertDescription = "One or more clustering elements in this device are down.\n\nThis alert was added per the request of <a target=\"_blank\" href=\"http://il.linkedin.com/pub/gal-vitenberg/83/484/103\">Gal Vitenberg</a>.", baseRemediationText = "Review the cause for one or more members being down or inoperable.")( RemediationStepCondition.VENDOR_CP -> "Review other alerts for a cause for the cluster failure.", RemediationStepCondition.VENDOR_PANOS -> "Log into the device over SSH and run \"less mp-log ha-agent.log\" for more information.", RemediationStepCondition.VENDOR_CISCO -> """| |1. Verify the communication between the FHRP peers. A random, momentary loss of data communication between the peers is the most common problem that results in continuous FHRP state change (ACT<-> STB) unless this error message occurs during the initial installation. |2. Check the CPU utilization by using the "show process CPU" NX-OS command. HSRP state changes are often due to High CPU Utilization. |3. Common problems for the loss of FHRP packets between the peers to investigate are physical layer problems, excessive network traffic caused by spanning tree issues or excessive traffic caused by each Vlan. | |A vPC problem could cause the change to the state so check the next : |1. Check that STP bridge assurance is not enabled on the vPC links. Bridge assurance should only be enabled on the vPC peer link |2. Compare the vPC domain IDs of the two switches and ensure that they match. Execute the "show vpc brief" to compare the output that should match across the vPC peer switches. |3. Verify that both the source and destination IP addresses used for the peer-keepalive messages are reachable from the VRF associated with the vPC peer-keepalive link. |Then, execute the "sh vpc peer-keepalive" NX-OS command and review the output from both switches. |4. Verify that the peer-keepalive link is up. Otherwise, the vPC peer link will not come up. |5. Review the vPC peer link configuration, execute the "sh vpc brief" NX-OS command and review the output. Besides, verify that the vPC peer link is configured as a Layer 2 port channel trunk that allows only vPC VLANs. |6. Ensure that type 1 consistency parameters match. If they do not match, then vPC is suspended. Items that are type 2 do not have to match on both Nexus switches for the vPC to be operational. Execute the "sh vpc consistency-parameters" command and review the output |7. Verify that the vPC number that you assigned to the port channel that connects to the downstream device from the vPC peer device is identical on both vPC peer devices |8. If you manually configured the system priority, verify that you assigned the same priority value on both vPC peer devices |9. Verify that the primary vPC is the primary STP root and the secondary vPC is the secondary STP root. |10. Review the logs for relevant findings |11. For more information please review the next vPC troubleshooting guide: |https://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus5000/sw/troubleshooting/guide/N5K_Troubleshooting_Guide/n5K_ts_vpc.html""".stripMargin )