Critical process(es) down-f5-all

Critical process(es) down-f5-all
0

Critical process(es) down-f5-all

Vendor: f5

OS: all

Description:
Many devices have critical processes, usually daemons, that must be up for certain functions to work. Indeni will alert if any of these goes down.

Remediation Steps:
Review the cause for the processes being down.

How does this work?
This script logs into the F5 unit via iControl REST and retrieves the status of running HA processes and verifies that all enabled services are up.

Why is this important?
Each device has certain executable processes which are critical to the stable operation of it. Within F5 units, these processes are responsible for the management layer. An example is the watchdog service which ensures that the system will reboot in the event of a lockup. A process being down may indicate a critical failure.

Without Indeni how would you find this?
If for instance “snmpd” would stop working, an administrator could check the status of HA services by logging into the device through SSH, entering TMSH and executing the command “show sys ha-status all-properties”. This would bring up a list of all HA processes and their status, including “snmpd”.

f5-rest-mgmt-tm-sys-ha-status

name: f5-rest-mgmt-tm-sys-ha-status
description: Determine status of critical processes
type: monitoring
monitoring_interval: 5 minutes
requires:
    vendor: f5
    product: load-balancer
    rest-api: 'true'
comments:
    process-state:
        why: |
            Each device has certain executable processes which are critical to the stable operation of it. Within F5 units, these processes are responsible for the management layer. An example is the watchdog service which ensures that the system will reboot in the event of a lockup. A process being down may indicate a critical failure.
        how: |
            This script logs into the F5 unit via iControl REST and retrieves the status of running HA processes and verifies that all enabled services are up.
        without-indeni: |
            If for instance "snmpd" would stop working, an administrator could check the status of HA services by logging into the device through SSH, entering TMSH and executing the command "show sys ha-status all-properties". This would bring up a list of all HA processes and their status, including "snmpd".
        can-with-snmp: false
        can-with-syslog: false
steps:
-   run:
        type: HTTP
        command: /mgmt/tm/sys/ha-status/stats?options=all-properties&$select=key,respProcess,enabled,failure,haFeature
    parse:
        type: JSON
        file: rest-mgmt-tm-sys-ha-status-stats.parser.1.json.yaml

cross_vendor_critical_process_down_novsx

// Deprecation warning : Scala template-based rules are deprecated. Please use YAML format rules instead.

package com.indeni.server.rules.library.templatebased.crossvendor

import com.indeni.server.common.data.conditions.Equals
import com.indeni.server.rules.RuleContext
import com.indeni.server.rules.library.templates.StateDownTemplateRule
import com.indeni.server.rules.RemediationStepCondition

case class cross_vendor_critical_process_down_novsx() extends StateDownTemplateRule(
  ruleName = "cross_vendor_critical_process_down_novsx",
  ruleFriendlyName = "All Devices: Critical process(es) down",
  ruleDescription = "Many devices have critical processes, usually daemons, that must be up for certain functions to work. Indeni will alert if any of these goes down.",
  metricName = "process-state",
  applicableMetricTag = "process-name",
  descriptionMetricTag = "description",
  alertItemsHeader = "Processes Affected",
  descriptionStringFormat = "${scope(\"description\")}",
  alertDescription = "One or more processes which are critical to the operation of this device, are down.",
  baseRemediationText = "Review the cause for the processes being down.",
  metaCondition = !Equals("vsx", "true"))(
  RemediationStepCondition.VENDOR_CP -> "Check if \"cpstop\" was run.",
  RemediationStepCondition.VENDOR_CISCO ->
    """|
      |1. Use the "show processes cpu" NX-OS command in order to show the CPU usage at the process level.
      |2. Use the "show process cpu detail <pid> " NX-OS command to find out the CPU usage for all threads that belong to a specific process ID (PID).
      |3. Use the "show system internal sysmgr service pid <pid> " NX-OS command in order to display additional details, such as restart time, crash status, and current state, on the process/service by PID.
      |4. Run the "show system internal processes cpu" NX-OS command which is equivalent to the top command in Linux and provides an ongoing look at processor activity in real time""".stripMargin,
  RemediationStepCondition.VENDOR_FORTINET ->
    """
      |1. Login via ssh to the Fortinet firewall and run the FortiOS command "diagnose sys top [refresh_time_sec] [number_of_lines]"
        |>>> to get the Proccess-id, State, CPU & Memory utilization per process. Press <shift-P> to sort by CPU usage or <shift-M> to sort by memory usage.
      |2. Login via ssh to the Fortinet firewall and run the FortiOS command "diagnose sys top-summary '-h' " to get the command options and receive additional
        |>>> info per process. A sample command could be "diagnose sys top-summary '-s mem -i 60 -n 10' ". In case that the value to the FDS (File Descriptors)
        |>>> column keeps constantly increasing, it might indicate a memory leak problem.
      |3. Review the state of each process provided by the above commands. The normal states are S (Sleeping), R (Running) and D (Do not Disturb).
        |>>> The abnormal states are Z (Zombie) and D (Do not Disturb).
      |4. Try to restart the process which has problem by running the command "diag sys kill 11 <process-Id>". The <process-Id> can be found by the aforementioned commands.
      |5. Check the logs for any reasons why the process stops or can't restart.
      |6. If the problem persists, contact Fortinet Technical support at https://support.fortinet.com/ for further assistance.""".stripMargin.replaceAll("\n>>>", "")

)