Hardware element down-radware-alteon-os

error
health-checks
alteon-os
radware
Hardware element down-radware-alteon-os
0

#1

Hardware element down-radware-alteon-os

Vendor: radware

OS: alteon-os

Description:
Alert if any hardware elements are not operating correctly.

Remediation Steps:
Troubleshoot the hardware element as soon as possible.

How does this work?
This script leverages the Radware Alteon’s API to detect that all available PSU slots are in use. Detection of an empty PSU slot will trigger an alert.

Why is this important?
It is difficult to identify when a power supply fails unless actively tragging the logs or setting up an SNMP trap. Even then, it is hard to determine whether or not the device comes back up within a reasonable amount of time and identifying patterns in which the PS fails is important as well.

Without Indeni how would you find this?
An administrator would need to SSH into CLI on the device and run the command “info/sys/ps” to clearly identify if a PSU is installed in each available slot.

radware-api-config-hwPowerSupplyStatus

#! META
name: radware-api-config-hwPowerSupplyStatus
description: Get the status of the power supplies.
type: monitoring
monitoring_interval: 5 minute 
requires:
    os.name: "alteon-os"
    vendor: "radware"

#! COMMENTS
power-supply-inserted:
    why: |
        Network devices that do not have redundant power supplies are at risk for a complete service outage should there be a loss of power to the device. It is best practice to install multiple PSU's with separate power sources whenever possible. 
    how: |
        This script leverages the Radware Alteon's API to detect that all available PSU slots are in use. Detection of an empty PSU slot will trigger an alert. 
    without-indeni: |
        An administrator would need to SSH into CLI on the device and run the command "info/sys/ps" to clearly identify if a PSU is installed in each available slot. 
    can-with-snmp: true
    can-with-syslog: false
hardware-element-status:
    why: |
        It is difficult to identify when a power supply fails unless actively tragging the logs or setting up an SNMP trap. Even then, it is hard to determine whether or not the device comes back up within a reasonable amount of time and identifying patterns in which the PS fails is important as well.
    how: |
        This script leverages the Radware Alteon's API to detect that all available PSU slots are in use. Detection of an empty PSU slot will trigger an alert. 
    without-indeni: |
        An administrator would need to SSH into CLI on the device and run the command "info/sys/ps" to clearly identify if a PSU is installed in each available slot. 
    can-with-snmp: true
    can-with-syslog: false

#! REMOTE::HTTP
url: /config/hwPowerSupplyStatus
protocol: HTTPS

#! PARSER::JSON


# Power supply states:
# singlePowerSupplyOk(1)
# firstPowerSupplyFailed(2)
# secondPowerSupplyFailed(3)
# doublePowerSupplyOk(4)
# singlePowerSupplyConnected (7)

# There are limitations to what Radware will provide in the output of the data for power supplies. 
# Very simply, it will only alert if the status of the PS is OK (values 1 or 4) or failed (values 2 and 3). 
# However, it is hard to determine anything else other than a single PS being operational (value 7). 
# Due to this, it is best not to show in live-config as the limitations of this script assumes the device has both PS active unless the output of the query indicates otherwise (values 2,3, or 7).

_metrics:
    -
        _groups:
            "$.[?(@.hwPowerSupplyStatus == 1)]":
                _tags:
                    "im.name":
                        _constant: "hardware-element-status"
                    "im.dstype.displayType":
                        _constant: "state"
                    "name":
                        _constant: "Primary Power Supply"
                    "live-config":
                        _constant: "true"
                    "display-name":
                        _constant: "Power Supplies"
                _value.double:
                    _constant: "1.0"
    -
        _groups:
            "$.[?(@.hwPowerSupplyStatus == 2)]":
                _tags:
                    "im.name":
                        _constant: "hardware-element-status"
                    "im.dstype.displayType":
                        _constant: "state"
                    "name":
                        _constant: "Primary Power Supply"
                    "live-config":
                        _constant: "true"
                    "display-name":
                        _constant: "Power Supplies"
                _value.double:
                    _constant: "0.0"
    -
        _groups:
            "$.[?(@.hwPowerSupplyStatus == 2)]":
                _tags:
                    "im.name":
                        _constant: "hardware-element-status"
                    "im.dstype.displayType":
                        _constant: "state"
                    "name":
                        _constant: "Secondary Power Supply"
                    "live-config":
                        _constant: "true"
                    "display-name":
                        _constant: "Power Supplies"
                _value.double:
                    _constant: "1.0"
    -
        _groups:
            "$.[?(@.hwPowerSupplyStatus == 3)]":
                _tags:
                    "im.name":
                        _constant: "hardware-element-status"
                    "im.dstype.displayType":
                        _constant: "state"
                    "name":
                        _constant: "Primary Power Supply"
                    "live-config":
                        _constant: "true"
                    "display-name":
                        _constant: "Power Supplies"
                _value.double:
                    _constant: "1.0"
    -
        _groups:
            "$.[?(@.hwPowerSupplyStatus == 3)]":
                _tags:
                    "im.name":
                        _constant: "hardware-element-status"
                    "im.dstype.displayType":
                        _constant: "state"
                    "name":
                        _constant: "Secondary Power Supply"
                    "live-config":
                        _constant: "true"
                    "display-name":
                        _constant: "Power Supplies"
                _value.double:
                    _constant: "0.0"
    -
        _groups:
            "$.[?(@.hwPowerSupplyStatus == 4)]":
                _tags:
                    "im.name":
                        _constant: "hardware-element-status"
                    "im.dstype.displayType":
                        _constant: "state"
                    "name":
                        _constant: "Primary Power Supply"
                    "live-config":
                        _constant: "true"
                    "display-name":
                        _constant: "Power Supplies"
                _value.double:
                        _constant: "1.0"
    -
        _groups:
            "$.[?(@.hwPowerSupplyStatus == 4)]":
                _tags:
                    "im.name":
                        _constant: "hardware-element-status"
                    "im.dstype.displayType":
                        _constant: "state"
                    "name":
                        _constant: "Secondary Power Supply"
                    "live-config":
                        _constant: "true"
                    "display-name":
                        _constant: "Power Supplies"
                _value.double:
                        _constant: "1.0"
    -
        _groups:
            "$.[?(@.hwPowerSupplyStatus == 7)]":
                _tags:
                    "im.name":
                        _constant: "hardware-element-status"
                    "im.dstype.displayType":
                        _constant: "state"
                    "name":
                        _constant: "Primary Power Supply"
                    "live-config":
                        _constant: "true"
                    "display-name":
                        _constant: "Power Supplies"
                _value.double:
                    _constant: "1.0"
    -
        _tags:
            "im.name":
                _constant: "power-supply-inserted"
            "name":
                _constant: "Power Supply Inserted"
            "im.dstype.displayType":
                _constant: "state"
        _temp:
            powerSupplyStatus:
                _value: "hwPowerSupplyStatus"
        _transform:
            _value.double: |
                {
                    if (temp("powerSupplyStatus") == 7){
                        print "0.0"
                    } else {
                        print "1.0"
                    }
                }

cross_vendor_hardware_element_status

package com.indeni.server.rules.library.templatebased.crossvendor

import com.indeni.server.rules.RuleContext
import com.indeni.server.rules.library.{ConditionalRemediationSteps, StateDownTemplateRule}

/**
  *
  */
case class cross_vendor_hardware_element_status(context: RuleContext) extends StateDownTemplateRule(context,
  ruleName = "cross_vendor_hardware_element_status",
  ruleFriendlyName = "All Devices: Hardware element down",
  ruleDescription = "Alert if any hardware elements are not operating correctly.",
  metricName = "hardware-element-status",
  applicableMetricTag = "name",
  alertItemsHeader = "Hardware Elements Affected",
  alertDescription = "The hardware elements listed below are not operating correctly.",
  baseRemediationText = "Troubleshoot the hardware element as soon as possible.")(
  ConditionalRemediationSteps.OS_NXOS ->
    """|While the port may be in up status, the link quality might be degraded and is not between the threshold levels. Check the following to troubleshoot this issue.
       |1.	Run the “show interface transceiver detailed” NX-OS command to display information about the transceivers connected to a specific interface. Besides, this NX-OS command output provides information about the Cisco SFP Product ID (PID). NOTE: In case that have been used 3rd party SFPs it is possible to get an Indeni alert because the current light signal is different than the recommended min/max thresholds defined by Cisco.
       |2.	Use the “show interface transceiver calibrations” NX-OS command to display calibration information for the transceiver interfaces.
       |3.	Consider to enable DOM (if supported). Digital Optical Monitoring or DOM is an industry wide standard, intended to define a SFP to access real-time operating parameters such as Tx power, Rx power etc. More details can be found below: https://www.cisco.com/c/en/us/td/docs/interfaces_modules/transceiver_modules/compatibility/matrix/DOM_matrix.html
       |4.	Cisco has published official specifications (Rx, Tx power level etc) per transceiver category and can be found at the following link:
        https://www.cisco.com/c/en/us/products/interfaces-modules/transceiver-modules/index.""".stripMargin,
  ConditionalRemediationSteps.VENDOR_FORTINET ->
    """
      |1. Login via ssh to the Fortinet firewall and run the FortiOS command "exec sensor list" to review the status of the hardware components and temperature
      |>>> thresholds. When the flag to the command output is set to 0, the component is working correctly and when flag is set to 1, the component has a problem.
      |>>> The FortiOS command "execute sensor detail" will show extra information such as the low/high thresholds. More details can be found here:
      |>>> http://kb.fortinet.com/kb/viewContent.do?externalId=FD36793&sliceId=1
      |2. Consider running the fotrinet hardware diagnostics commands. While they do not detect all hardware malfunctions, tests for the most common hardware
      |>>> problems are performed. More details can be found here:
      |- http://kb.fortinet.com/kb/viewContent.do?externalId=FD39581&sliceId=1
      |- http://kb.fortinet.com/kb/documentLink.do?externalID=FD34745
      |3. It is recommended that any failed fan or power supply unit should be replaced immediately.
      |4. The cooling system for the devices should be installed to avoid overheat.
      |5. If the problem persists, contact Fortinet Technical support at https://support.fortinet.com/ for further assistance.""".stripMargin.replaceAll("\n>>>", "")
)