Virtual systems restarted (uptime low)-fortinet-FortiOS

health-checks
critical
fortios
fortinet
Virtual systems restarted (uptime low)-fortinet-FortiOS
0

#1

Virtual systems restarted (uptime low)-fortinet-FortiOS

Vendor: fortinet

OS: FortiOS

Description:
Indeni will alert when a virtual system has restarted.

Remediation Steps:
Determine why the virtual system(s) was restarted.

How does this work?
Indeni uses the built-in Fortinet “get system performance status” command to retrieve the current device up-time.

Why is this important?
Capture the uptime of the device. If the uptime is lower than the previous sample, the device must have reloaded.

Without Indeni how would you find this?
An administrator could login and manually run the command via CLI, check the system resources widget via the GUI, enable SNMP, or use Fortinet FortiAnalyzer.

fortios-get-system-performance-status

#! META
name: fortios-get-system-performance-status
description: Performance metrics based on "get system performance status" command on Fortinet firewall
type: monitoring
monitoring_interval: 1 minute
includes_resource_data: true
requires:
    vendor: "fortinet"
    os.name: "FortiOS"
    product: "firewall"
    vdom_enabled: "false"

#! COMMENTS
memory-usage:
    why: |
        If the firewall memory becomes fully utilized, performance may be impacted and traffic may be dropped, and in extreme cases the firewall could crash. It is critical to monitor the memory usage and handle the issue prior to resource exhaustion.
    how: |
        Indeni uses the built-in Fortinet "get system performance status" command to retrieve the device memory utilization.
    without-indeni: |
        An administrator could login and manually run the command via CLI, check the system resources widget via the GUI, enable SNMP, configure a syslog server for a log message every 5 minutes containing the utilization, or use Fortinet FortiAnalyzer.
    can-with-snmp: true
    can-with-syslog: true

cpu-usage:
    why: |
        If the firewall CPU becomes fully utilized, performance may be impacted and traffic may be dropped, and in extreme cases the firewall could crash. It is critical to monitor the memory usage and handle the issue prior to resource exhaustion.
    how: |
        Indeni uses the built-in Fortinet "get system performance status" command to retrieve the device CPU utilization.
    without-indeni: |
        An administrator could login and manually run the command via CLI, check the system resources widget via the GUI, enable SNMP, configure a syslog server for a log message every 5 minutes containing the utilization, or use Fortinet FortiAnalyzer.
    can-with-snmp: true
    can-with-syslog: true

uptime-milliseconds:
    why: |
        Capture the uptime of the device. If the uptime is lower than the previous sample, the device must have reloaded.
    how: |
        Indeni uses the built-in Fortinet "get system performance status" command to retrieve the current device up-time.
    without-indeni: |
        An administrator could login and manually run the command via CLI, check the system resources widget via the GUI, enable SNMP, or use Fortinet FortiAnalyzer.
    can-with-snmp: true
    can-with-syslog: false

memory-free-kbytes:
    skip-documentation: true
memory-total-kbytes:
    skip-documentation: true
memory-used-kbytes:
    skip-documentation: true

#! REMOTE::SSH
get system performance status

#! PARSER::AWK

function writeCpuUsageMetric(id, cpuIdleAmount, cpuIsAverage) {
    sub(/%/, "", cpuIdleAmount)

    tags_cpu["cpu-id"] = id
    tags_cpu["cpu-is-avg"] = cpuIsAverage
    tags_cpu["resource-metric"] = "true"
    writeDoubleMetricWithLiveConfig("cpu-usage", tags_cpu, "gauge", 0, 100 - cpuIdleAmount, "CPU Usage", "percentage", "cpu-id")
}

# v5.4
#Memory states: 66% used
/^Memory states:/ {
    memory_usage = substr($3, 1, 2)

    # the following "RAM" tag value does NOT surface in the UI. It's here just to satisfy the
    # requirements of the rule -- for some reason, we need to have this tag _with_ a value for things
    # to function properly.

    tags_memory["name"] = "RAM"
    tags_memory["resource-metric"] = "true"
    writeDoubleMetricWithLiveConfig("memory-usage", tags_memory, "gauge", 0, memory_usage, "Memory Usage", "percentage", "")
}

# v5.6
#Memory: 1019996k total, 354312k used (34%), 665684k free (66%), 1616k buffers
/^Memory:/ {
    percent_memory_usage = substr($6, 2, 2)
    free = substr($7, 1, length($7) - 1)
    total = substr($2, 1, length($2) - 1)
    used = substr($4, 1, length($4) - 1)

    tags_memory["name"] = "Memory: Free"
    writeDoubleMetricWithLiveConfig("memory-free-kbytes", tags_memory, "gauge", "60", free, "Memory Usage", "kilobytes", "name")

    tags_memory["name"] = "Memory: Total"
    writeDoubleMetricWithLiveConfig("memory-total-kbytes", tags_memory, "gauge", "60", total, "Memory Usage", "kilobytes", "name")

    tags_memory["name"] = "Memory: Used"
    writeDoubleMetricWithLiveConfig("memory-used-kbytes", tags_memory, "gauge", "60", used, "Memory Usage", "kilobytes", "name")

    tags_memory["name"] = "Memory Usage"
    tags_memory["resource-metric"] = "true"
    writeDoubleMetricWithLiveConfig("memory-usage", tags_memory, "gauge", 0, percent_memory_usage, "Memory Usage", "percentage", "name")
}

# This section handles the "per core" metrics for CPU usage
# v5.4
#CPU0 states: 2% user 4% system 0% nice 94% idle
# v5.6
#CPU1 states: 6% user 8% system 0% nice 86% idle 0% iowait 0% irq 0% softirq
/^CPU[0-9]+ states:/ {
    writeCpuUsageMetric("Per Core - " $1, $9, "false")
}

# "CPU states:" shows the average CPU usage across all CPU cores
# v5.4
#CPU states: 8% user 10% system 0% nice 82% idle
# v5.6
#CPU states: 4% user 6% system 0% nice 90% idle 0% iowait 0% irq 0% softirq
/^CPU states:/ {
    writeCpuUsageMetric("Average - CPU", $9, "true")
}

#Uptime: 3 days,  6 hours,  10 minutes
/^Uptime:/ {
    days = $2
    hours = $4
    minutes = $6
    uptime_in_seconds = days * 86400 + hours * 3600 + minutes * 60
    # Display in Overview - Live Config the uptime (in seconds)
    writeDoubleMetricWithLiveConfig("uptime-milliseconds", null, "gauge", 0, (uptime_in_seconds*1000), "Device Uptime", "duration", "")
}



cross_vendor_uptime_low_vsx

package com.indeni.server.rules.library.templatebased.crossvendor

import com.indeni.apidata.time.TimeSpan
import com.indeni.apidata.time.TimeSpan.TimePeriod
import com.indeni.server.rules.RuleContext
import com.indeni.server.rules.library.{ConditionalRemediationSteps, ThresholdDirection, TimeThresholdOnDoubleMetricWithItemsTemplateRule}
import com.indeni.server.sensor.models.managementprocess.alerts.dto.AlertSeverity

/**
  *
  */
case class cross_vendor_uptime_low_vsx(context: RuleContext) extends TimeThresholdOnDoubleMetricWithItemsTemplateRule(
      context,
      ruleName = "cross_vendor_uptime_low_vsx",
      ruleFriendlyName = "All Devices (VSX): Virtual systems restarted (uptime low)",
      ruleDescription = "Indeni will alert when a virtual system has restarted.",
      severity = AlertSeverity.CRITICAL,
      metricName = "uptime-milliseconds",
      threshold = TimeSpan.fromMinutes(60),
      metricUnits = TimePeriod.MILLISECOND,
      thresholdDirection = ThresholdDirection.BELOW,
      applicableMetricTag = "vs.name",
      alertItemsHeader = "Affected Virtual Systems",
      alertItemDescriptionFormat = "The current uptime is %.0f seconds which seems to indicate the virtual system has restarted.",
      alertItemDescriptionUnits = TimePeriod.SECOND,
      alertDescription = "Some virtual systems on this device have restarted recently. Review the list below.",
      baseRemediationText = "Determine why the virtual system(s) was restarted."
    )(
      ConditionalRemediationSteps.OS_NXOS ->
        """|
       |1. Use the "show version" or "show system reset-reason" NX-OS commands to display the reason for the reload
       |2. Use the "show cores" command to determine if a core file was recorded during the unexpected reboot
       |3.  Run the "show process log" command to display the processes and if a core was created.
       |4.  With the show logging command, review the events that happened close to the time of reboot
    """.stripMargin
    )