Blade(s) down-paloaltonetworks-panos

error
health-checks
panos
paloaltonetworks
Blade(s) down-paloaltonetworks-panos
0

#1

Blade(s) down-paloaltonetworks-panos

Vendor: paloaltonetworks

OS: panos

Description:
Indeni will alert one or more blades in a chassis is down.

Remediation Steps:
Review the cause for the blades being down.

How does this work?
This script logs into the Palo Alto Networks device using SSH and retrieves the output of the “show log system subtype equal hw direction equal backward csv-output equal yes opaque contains Slot receive_time in last-hour” command. The output includes logs from the past one hour. The CLI command will get only the logs related to this issue, I used the filter “csv-output” as it will easier to deal with csv than regular output with spaces/tabs. The desired state is when “Slot # is up” is assigned a value of 1, anything else will be assigned a value of 0.

Why is this important?
Dataplane restarts can cause network outages, and it is important to immediately detect this type of failures and address them asap. This failure can be hardware/software and customers who receive this alert need to engage vendor support to invistigate the root cause of this restart.

Without Indeni how would you find this?
An administrator could physically view the LED lights for alarm status. Or by looking at the system logs from GUI or CLI.

panos-show-log-system-subtype-equal-hw

#! META
name: panos-show-log-system-subtype-equal-hw
description: Quert system logs for any slot failure
type: monitoring
monitoring_interval: 1 minutes
requires:
    vendor: paloaltonetworks
    os.name: panos
    product: firewall

#! COMMENTS
blade-state:
    why: |
        Dataplane restarts can cause network outages, and it is important to immediately detect this type of failures and address them asap. This failure can be hardware/software and customers who receive this alert need to engage vendor support to invistigate the root cause of this restart.
    how: |
        This script logs into the Palo Alto Networks device using SSH and retrieves the output of the "show log system subtype equal hw direction equal backward csv-output equal yes opaque contains Slot receive_time in last-hour" command. The output includes logs from the past one hour. The CLI command will get only the logs related to this issue, I used the filter "csv-output" as it will easier to deal with csv than regular output with spaces/tabs. The desired state is when "Slot # is up" is assigned a value of 1, anything else will be assigned a value of 0.
    without-indeni: |
        An administrator could physically view the LED lights for alarm status. Or by looking at the system logs from GUI or CLI.
    can-with-snmp: false
    can-with-syslog: true

#! REMOTE::SSH
show log system subtype equal hw direction equal backward csv-output equal yes opaque contains Slot receive_time in last-hour

#! PARSER::AWK

BEGIN {
metric_tags["live-config"] = "true"
metric_tags["display-name"] = "Slot"
metric_tags["im.identity-tags"] = "name"
}

# SAMPLE Output
#Domain,Receive Time,Serial #,Type,Threat/Content Type,Config Version,Generate Time,Virtual System,Event ID,Object,fmt,id,module,Severity,Description,Sequence Number,Action Flags,dg_hier_level_1,dg_hier_level_2,dg_hier_level_3,dg_hier_level_4,Virtual System Name,Device Name
#1,10/7/18 18:43,0,SYSTEM,hw,0,10/7/18 18:43,,slot-up,,0,0,general,informational,Slot 1 (PA-5220) is up.,6.60979E+18,0x8000000000000000,0,0,0,0,,PA-5220
#1,10/7/18 18:43,0,SYSTEM,hw,0,10/7/18 18:43,,slot-up,,0,0,general,informational,Slot 1 (PA-5220) is up.,6.60979E+18,0x8000000000000000,0,0,0,0,,PA-5220
#1,10/7/18 18:42,0,SYSTEM,hw,0,10/7/18 18:42,,slot-up,,0,0,general,informational,Slot 1 (PA-5220) is up.,6.60979E+18,0x8000000000000000,0,0,0,0,,PA-5220
#1,10/7/18 18:42,0,SYSTEM,hw,0,10/7/18 18:42,,slot-starting,,0,0,general,informational,Slot 1 (PA-5220) is starting.,6.60979E+18,0x8000000000000000,0,0,0,0,,PA-5220
#1,10/7/18 18:42,0,SYSTEM,hw,0,10/7/18 18:42,,slot-starting,,0,0,general,informational,Slot 1 (PA-5220) is starting.,6.60979E+18,0x8000000000000000,0,0,0,0,,PA-5220
#1,10/7/18 18:40,0,SYSTEM,hw,0,10/7/18 18:40,,slot-starting,,0,0,general,informational,Slot 1 (PA-5220) is starting.,6.60979E+18,0x8000000000000000,0,0,0,0,,PA-5220
#1,10/7/18 18:39,0,SYSTEM,hw,0,10/7/18 18:39,,slot-starting,,0,0,general,informational,Slot 1 (PA-5220) is starting.,6.60979E+18,0x8000000000000000,0,0,0,0,,PA-5220
#1,10/7/18 18:39,0,SYSTEM,hw,0,10/7/18 18:38,,slot-starting,,0,0,general,informational,Slot 1 (PA-5220) is starting.,6.60979E+18,0x8000000000000000,0,0,0,0,,PA-5220
#1,10/7/18 18:28,0,SYSTEM,hw,0,10/7/18 18:28,,slot-up,,0,0,general,informational,Slot 1 (PA-5220) is up.,6.60979E+18,0x8000000000000000,0,0,0,0,,PA-5220
#1,10/7/18 18:27,0,SYSTEM,hw,0,10/7/18 18:27,,slot-up,,0,0,general,informational,Slot 1 (PA-5220) is up.,6.60979E+18,0x8000000000000000,0,0,0,0,,PA-5220
#1,10/7/18 18:27,0,SYSTEM,hw,0,10/7/18 18:27,,slot-up,,0,0,general,informational,Slot 1 (PA-5220) is up.,6.60979E+18,0x8000000000000000,0,0,0,0,,PA-5220
#1,10/7/18 18:27,0,SYSTEM,hw,0,10/7/18 18:27,,slot-starting,,0,0,general,informational,Slot 1 (PA-5220) is starting.,6.60979E+18,0x8000000000000000,0,0,0,0,,PA-5220
#1,10/7/18 18:27,0,SYSTEM,hw,0,10/7/18 18:27,,slot-starting,,0,0,general,informational,Slot 1 (PA-5220) is starting.,6.60979E+18,0x8000000000000000,0,0,0,0,,PA-5220
#1,10/7/18 18:25,0,SYSTEM,hw,0,10/7/18 18:25,,slot-starting,,0,0,general,informational,Slot 1 (PA-5220) is starting.,6.60979E+18,0x8000000000000000,0,0,0,0,,PA-5220
#1,10/7/18 18:24,0,SYSTEM,hw,0,10/7/18 18:23,,slot-starting,,0,0,general,informational,Slot 1 (PA-5220) is starting.,6.60979E+18,0x8000000000000000,0,0,0,0,,PA-5220
#1,10/7/18 18:24,0,SYSTEM,hw,0,10/7/18 18:23,,slot-starting,,0,0,general,informational,Slot 1 (PA-5220) is starting.,6.60979E+18,0x8000000000000000,0,0,0,0,,PA-5220

{
    #The first line includes the names of the fields, so this can be ignored. Our focus should be on the latest log message to get the status of the slot, therefore no need to go through the other lines
    if ( NR == 2 )   {
        #The last log will be on the 2nd line of the output, therefore we are using NR = 2. If there are no logs, the line will be empty.
        split($0, log_message, ",")
        gsub(/[\"\.]/, "" , log_message[15])
        #We are interested in description field which is #15 of the list, for example;
        ##1,10/7/18 18:24,0,SYSTEM,hw,0,10/7/18 18:23,,slot-starting,,0,0,general,informational,Slot 1 (PA-5220) is starting.,6.60979E+18,0x8000000000000000,0,0,0,0,,PA-5220
        #We need to get this part "Slot 1 (PA-5220) is starting."
        split(log_message[15], error, " ")
        #error[2] is going to match the slot #, for example; 1 from "Slot 1"
        if (error[2] == /[1-9]/) {
            slot = "Slot " error[2]
        }
        metric_tags["name"] = slot
        metric_tags["im.dstype.displaytype"] = "number"
        #error[5] js going to match the latest reported state, for example; starting or up
        status = error[5]
        if (status == "up") {
            blade_state = 1
        } else {
            blade_state = 0
        }
        metric_tags["im.dstype.displaytype"] = "state"
        writeDoubleMetricWithLiveConfig("blade-state", metric_tags, "gauge", "300", blade_state, "Slot status", "state", "name")
        exit
    }
}


chassis_blade_down

package com.indeni.server.rules.library.templatebased.crossvendor

import com.indeni.server.rules.RuleContext
import com.indeni.server.rules.library.ConditionalRemediationSteps
import com.indeni.server.rules.library.templates.StateDownTemplateRule

/**
  *
  */
case class chassis_blade_down() extends StateDownTemplateRule(
  ruleName = "chassis_blade_down",
  ruleFriendlyName = "Chassis Devices: Blade(s) down",
  ruleDescription = "Indeni will alert one or more blades in a chassis is down.",
  metricName = "blade-state",
  applicableMetricTag = "name",
  alertItemsHeader = "Blades Affected",
  alertDescription = "One or more blades in this chassis are down.",
  baseRemediationText = "Review the cause for the blades being down.")(
  ConditionalRemediationSteps.VENDOR_CP -> "If the blade was not stopped intentionally (admin down), check to see it wasn't disconnected physically.",
  ConditionalRemediationSteps.OS_NXOS ->
    """|
      |Most of the module related failures (such as the module not coming up, the module getting reloaded, and so on) can be analyzed by looking at the logs stored on the switch. Use the following CLI commands to identify the problem:
      |•show system reset-reason module
      |•show version
      |•show logging
      |•show module internal exception-log
      |•show module internal event-history module
      |•show module internal event-history errors
      |•show platform internal event-history errors
      |•show platform internal event-history module
      |Further details can be found to the next CISCO troubleshooting guide:
      |https://www.cisco.com/en/US/products/ps5989/prod_troubleshooting_guide_chapter09186a008067a0ef.html""".stripMargin
)