Critical process(es) down (per VS)-juniper-junos

error
health-checks
junos
juniper
Critical process(es) down (per VS)-juniper-junos
0

#1

Critical process(es) down (per VS)-juniper-junos

Vendor: juniper

OS: junos

Description:
Many devices have critical processes, usually daemons, that must be up for certain functions to work. indeni will alert if any of these goes down.

Remediation Steps:
Review the cause for the processes being down.

How does this work?
This script monitors those critical processes and identifies the states for those processes by runnning the “show system processes extensive” command via SSH connection.

Why is this important?
Many processes are critical to the device functionality and health. It is important to monitor these processes to make sure they are running in normal states.

Without Indeni how would you find this?
An administrator could login and manually run the command “show system processes extensive” to get the same information.

junos-show-system-processes-extensive

#! META
name: junos-show-system-processes-extensive
description: Retrieve system process information
type: monitoring
monitoring_interval: 10 minute
requires:
    vendor: juniper
    os.name: junos
    product: firewall
    high-availability:
        neq: true

#! COMMENTS
process-state:
    why: |
        Many processes are critical to the device functionality and health. It is important to monitor these processes to make sure they are running in normal states.
    how: |
        This script monitors those critical processes and identifies the states for those processes by runnning the "show system processes extensive" command via SSH connection.
    without-indeni: |
        An administrator could login and manually run the command "show system processes extensive" to get the same information. 
    can-with-snmp: false
    can-with-syslog: false
    vendor-provided-management: |
        This is only accessible from the command line interface.

process-cpu:
    why: |
        Each process should have adequate CPU resource allocated. 
    how: |
        List the CPU usage for all processes by running the "show system processes extensive" command via SSH connection.
    without-indeni: |
        An administrator would need to log in manually when the issue is occurring in order to deduce a root cause.
    can-with-snmp: false
    can-with-syslog: false
    vendor-provided-management: |
        This is only accessible from the command line interface.

process-memory:
    why: |
        Each process should have adequate memory resource allocated. 
    how: |
        List the memory usage for all processes by running the "show system processes extensive" command via SSH connection..
    without-indeni: |
        An administrator would need to log in manually when the issue is occurring in order to deduce a root cause.
    can-with-snmp: false
    can-with-syslog: false
    vendor-provided-management: |
        This is only accessible from the command line interface.

#! REMOTE::SSH
show system processes extensive

#! PARSER::AWK

#The function converts memory unit into KiloBytes
function parseBytes(_value) {
    _multiplier = 1
    if (_value ~ /M$/) {
        _multiplier = 1024 * 1024
    } else if (_value ~ /G$/) {
        _multiplier = 1024 * 1024 * 1024
    } else {
        _multiplier = 1024
    }

    _valueNumber = _value
    sub(/(M|G|K)/, "", _valueNumber)

    return _valueNumber * _multiplier
}

BEGIN { 
    # The management daemon (MGD) serves a central role in the user-interface component of JUNOS
    criticalprocess["mgd"] = "false"
    processdescription["mgd"] = "This process serves a centrol role in the user-interface component."
    # the routing protocols processes on a machine running logical routers
    criticalprocess["rpd"] = "false"
    processdescription["rpd"] = "This process serves a machine running logical routers." 
    # The chassis daemon (chassisd) supports all chassis, alarm, and environmental processes.
    criticalprocess["chassisd"] = "false"
    processdescription["chassisd"] = "This process supports all chassis, alarm, and environmental processes." 
    # This process is responsible for exchanging messages and doing failover between devices. 
    criticalprocess["jsrpd"] = "false"
    processdescription["jsrpd"] = "This process is responsible for exchanging messages and doing failover between devices." 
    # Junos OS runs PKId for certificate validation.
    criticalprocess["pkid"] = "false"
    processdescription["pkid"] = "This process verifies certificates." 
    
}

#Mem: 168M Active, 61M Inact, 224M Wired, 12M Cache, 61M Buf, 13M Free
/^Mem:/ {
    totalMem = 0
    for (i = 1; i <= NF; i++) {
        if ($i ~ /[0-9]+(M|G|K)/) {
            totalMem = totalMem + parseBytes($i)
        }
    }
}

#   PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
/(PID|USERNAME|COMMAND)/ {
    getColumns(trim($0), "[ \t]+", columns)
}

# 1449 root        4  76    0   214M 45968K select 0 269:40 99.17% flowd_octeon
/root.*%/ {
    wcpuCol = getColId(columns, "WCPU")
    pidCol = getColId(columns, "PID")
    resCol = getColId(columns, "RES")
    processnameCol = getColId(columns, "COMMAND")

    cpu = $wcpuCol
    sub(/%/, "", cpu)

    memory = parseBytes($resCol) / totalMem * 100
    pid = $pidCol
    processname = $processnameCol
    command = $processnameCol
    for (i = processnameCol + 1; i <= NF; i++) {
        command = command " " $i
    }

    if (processname in criticalprocess && criticalprocess[processname] == "false") {
        criticalprocess[processname] = "true"
    }

    if (command !~ /^(flowd_octeon|idle.*)$/) {
        pstags["name"] = pid
        pstags["process-name"] = processname
        gsub(/,/, "", command)
        pstags["command"] = command

        writeDoubleMetric("process-cpu", pstags, "gauge", "60", cpu)
        writeDoubleMetric("process-memory", pstags, "gauge", "60", memory)
    }
} 

END {
    for (processname in criticalprocess) {
        if (criticalprocess[processname] == "false") {
            status = 0
        } else {
            status = 1
        }
        criticalpstags["process-name"] = processname
        criticalpstags["command"] = processname
        criticalpstags["description"] = processdescription[processname] 

        writeDoubleMetric("process-state", criticalpstags, "gauge", "60", status)
    }
}

cross_vendor_critical_process_down_vsx

package com.indeni.server.rules.library.templatebased.crossvendor

import com.indeni.server.rules.RuleContext
import com.indeni.server.rules.library.{ConditionalRemediationSteps, StateDownTemplateRule}
import com.indeni.apidata.time.TimeSpan

/**
  *
  */
case class cross_vendor_critical_process_down_vsx() extends StateDownTemplateRule(
  ruleName = "cross_vendor_critical_process_down_vsx",
  ruleFriendlyName = "All Devices: Critical process(es) down (per VS)",
  ruleDescription = "Many devices have critical processes, usually daemons, that must be up for certain functions to work. indeni will alert if any of these goes down.",
  metricName = "process-state",
  applicableMetricTag = "process-name",
  descriptionMetricTag = "vs.name",
  alertItemsHeader = "Processes Affected",
  alertDescription = "One or more processes which are critical to the operation of this device, are down.",
  baseRemediationText = "Review the cause for the processes being down.")(
  ConditionalRemediationSteps.VENDOR_CP -> "Check if \"cpstop\" was run.",
  ConditionalRemediationSteps.OS_NXOS ->
    """|
      |1. Use the "show processes cpu" NX-OS command in order to show the CPU usage at the process level.
      |2. Use the "show process cpu detail <pid>" NX-OS command to find out the CPU usage for all threads that belong to a specific process ID (PID).
      |3. Use the "show system internal sysmgr service pid <pid>" NX-OS command in order to display additional details, such as restart time, crash status, and current state, on the process/service by PID.
      |4. Run the "show system internal processes cpu" NX-OS command which is equivalent to the top command in Linux and provides an ongoing look at processor activity in real time.""".stripMargin
)