Critical process(es) down-paloaltonetworks-panos

error
health-checks
panos
paloaltonetworks
Critical process(es) down-paloaltonetworks-panos
0

#1

Critical process(es) down-paloaltonetworks-panos

Vendor: paloaltonetworks

OS: panos

Description:
Many devices have critical processes, usually daemons, that must be up for certain functions to work. Indeni will alert if any of these goes down.

Remediation Steps:
Review the cause for the processes being down.

How does this work?
This script logs into the Palo Alto Networks firewall through SSH and retrieves the status of running processes. It then compares the list of running processes to a known list of processes that are critical and checks to see they are all up. Those that are down are flagged as such.

Why is this important?
Each device has certain executable processes which are critical to the stable operation of it. Within Palo Alto Networks firewalls, these processes are responsible for the management layer (mgmtsrvr), certain services (like dhcp and snmp), VPN (like ikemgr and keymgr) and many other functions. A process being down may indicate a critical failure.

Without Indeni how would you find this?
An administrator would need to write a script to poll their firewalls for the data. The other option is to pull this data during an outage.

panos-debug-system-process-info

#! META
name: panos-debug-system-process-info
description: Grab list of processes
type: monitoring
monitoring_interval: 10 minutes
requires:
    vendor: paloaltonetworks
    os.name: panos

#! COMMENTS
process-state:
    why: |
        Each device has certain executable processes which are critical to the stable operation of it. Within Palo Alto Networks firewalls, these processes are responsible for the management layer (mgmtsrvr), certain services (like dhcp and snmp), VPN (like ikemgr and keymgr) and many other functions. A process being down may indicate a critical failure.
    how: |
        This script logs into the Palo Alto Networks firewall through SSH and retrieves the status of running processes. It then compares the list of running processes to a known list of processes that are critical and checks to see they are all up. Those that are down are flagged as such.
    without-indeni: |
        An administrator would need to write a script to poll their firewalls for the data. The other option is to pull this data during an outage.
    can-with-snmp: false
    can-with-syslog: false
process-cpu:
    skip-documentation: true
process-memory:
    skip-documentation: true

#! REMOTE::SSH
debug system process-info

#! PARSER::AWK
BEGIN {
    critprocesses["mgmtsrvr"] = "Management Plane server for configuring policy device settings etc."
    critprocesses["all_task"] = "N/A"
    critprocesses["authd"] = "Handles authentication functionality"
    critprocesses["brdagent"] = "Mother board agent hardware monitor"
    critprocesses["chasd"] = "N/A"
    critprocesses["comm"] = "N/A"
    critprocesses["crypto"] = "N/A"
    critprocesses["dagger"] = "N/A"
    critprocesses["devsrvr"] = "Device server"
    critprocesses["dha"] = "N/A"
    critprocesses["dhcp"] = "DHCP service"
    critprocesses["dnsproxy"] = "DNS Proxy"
    critprocesses["ehmon"] = "Hardware monitor"
    critprocesses["ha-sshd"] = "High-availability SSH connection"
    critprocesses["ha_agent"] = "High-availability agent"
    critprocesses["ikemgr"] = "IKE Manager for IPsec tunnels"
    critprocesses["keymgr"] = "Key Manager for IPsec tunnels"
    critprocesses["l3svc"] = "Captive portal server"
    critprocesses["logrcvr"] = "Log receiver"
    critprocesses["masterd"] = "Responsible for the health of other critical processes"
    critprocesses["monitor"] = "N/A"
    critprocesses["monitor-dp"] = "Dataplane monitoring process"
    critprocesses["mprelay"] = "Responsible for communication between the management plane and the data plane"
    critprocesses["rasmgr"] = "Remote access service - GlobalProtect"
    critprocesses["routed"] = "Dynamic routing daemon"
    critprocesses["satd"] = "N/A"
    critprocesses["snmpd"] = "SNMP trap sender and agent"
    critprocesses["sshd"] = "SSH server"
    critprocesses["sslmgr"] = "SSL certificate manager"
    critprocesses["sslvpn"] = "SSL VPN service"
    critprocesses["sysd"] = "N/A"
    critprocesses["sysdagent"] = "N/A"
    critprocesses["useridd"] = "User-ID agent daemon"
    critprocesses["varrcvr"] = "WildFire registration server"
    critprocesses["websrvr"] = "Web interface server"
    # Initialize the foundprocess array
    for (key in critprocesses) { foundprocess[key] = "0.0" }
}

#Name                   PID      CPU%  FDs Open   Virt Mem     Res Mem      State
#all_task               2518     4     9          1978716      1914604      S
/ S \s*$/ {
    if (foundprocess[$1] == "0.0") {
        foundprocess[$1] = "1.0"
    }
    pid = $2
    cpu = $3
    virtmem = $5
    resmem = $6
    state = $NF
    processname = $1
    command = $1

    pstags["name"] = pid
    pstags["process-name"] = processname
    pstags["command"] = command

    writeDoubleMetricWithLiveConfig("process-cpu", pstags, "gauge", "600", cpu, "Processes (CPU)", "percentage", "name|process-name")

}
END {
    for (process in foundprocess) {
        # By running the "if (foundprocess[$1] == "0.0")" further above we create empty entries in foundprocess, so
        # we must skip those
        if (foundprocess[process] == "0.0" || foundprocess[process] == "1.0") {
            critpstags["name"] = process
            critpstags["description"] = critprocesses[process]
            writeDoubleMetric("process-state", critpstags, "gauge", "600", foundprocess[process])
        }
    }
}

cross_vendor_critical_process_down_novsx

package com.indeni.server.rules.library.templatebased.crossvendor

import com.indeni.server.common.data.conditions.Equals
import com.indeni.server.rules.RuleContext
import com.indeni.server.rules.library._

case class cross_vendor_critical_process_down_novsx(context: RuleContext) extends StateDownTemplateRule(context,
  ruleName = "cross_vendor_critical_process_down_novsx",
  ruleFriendlyName = "All Devices: Critical process(es) down",
  ruleDescription = "Many devices have critical processes, usually daemons, that must be up for certain functions to work. Indeni will alert if any of these goes down.",
  metricName = "process-state",
  applicableMetricTag = "process-name",
  descriptionMetricTag = "description",
  alertItemsHeader = "Processes Affected",
  descriptionStringFormat = "${scope(\"description\")}",
  alertDescription = "One or more processes which are critical to the operation of this device, are down.",
  baseRemediationText = "Review the cause for the processes being down.",
  metaCondition = !Equals("vsx", "true"))(
  ConditionalRemediationSteps.VENDOR_CP -> "Check if \"cpstop\" was run.",
  ConditionalRemediationSteps.OS_NXOS ->
    """|
      |1. Use the "show processes cpu" NX-OS command in order to show the CPU usage at the process level.
      |2. Use the "show process cpu detail <pid> " NX-OS command to find out the CPU usage for all threads that belong to a specific process ID (PID).
      |3. Use the "show system internal sysmgr service pid <pid> " NX-OS command in order to display additional details, such as restart time, crash status, and current state, on the process/service by PID.
      |4. Run the "show system internal processes cpu" NX-OS command which is equivalent to the top command in Linux and provides an ongoing look at processor activity in real time""".stripMargin,
  ConditionalRemediationSteps.VENDOR_FORTINET ->
    """
      |1. Login via ssh to the Fortinet firewall and run the FortiOS command "diagnose sys top [refresh_time_sec] [number_of_lines]"
        |>>> to get the Proccess-id, State, CPU & Memory utilization per process. Press <shift-P> to sort by CPU usage or <shift-M> to sort by memory usage.
      |2. Login via ssh to the Fortinet firewall and run the FortiOS command "diagnose sys top-summary '-h' " to get the command options and receive additional
        |>>> info per process. A sample command could be "diagnose sys top-summary '-s mem -i 60 -n 10' ". In case that the value to the FDS (File Descriptors)
        |>>> column keeps constantly increasing, it might indicate a memory leak problem.
      |3. Review the state of each process provided by the above commands. The normal states are S (Sleeping), R (Running) and D (Do not Disturb).
        |>>> The abnormal states are Z (Zombie) and D (Do not Disturb).
      |4. Try to restart the process which has problem by running the command "diag sys kill 11 <process-Id>". The <process-Id> can be found by the aforementioned commands.
      |5. Check the logs for any reasons why the process stops or can't restart.
      |6. If the problem persists, contact Fortinet Technical support at https://support.fortinet.com/ for further assistance.""".stripMargin.replaceAll("\n>>>", "")

)