Some VSes have high CPU usage-checkpoint-gaia

error
health-checks
checkpoint
gaia
Some VSes have high CPU usage-checkpoint-gaia
0

#1

Some VSes have high CPU usage-checkpoint-gaia

Vendor: checkpoint

OS: gaia

Description:
indeni will alert when a virtual system’s CPU utilization is too high.

Remediation Steps:
Determine the cause for the high CPU usage of the listed cores. This may indicate a need for more cores needs to be added.\nReview the following article for further information on high CPU utilization on Check Point firewalls. https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solutionid=sk98348

How does this work?
Indeni issues a combination of Linux and Checkpoint commands to discover the processes and threads associated with a given VS, and then adds up the CPU usage, per CPU core, for each VS. Indeni reports both the average and per core usage, and alerts the user if usage is above a certain threshold.

Why is this important?
High CPU usage could cause traffic to be dropped and may result in notable performance issues.

Without Indeni how would you find this?
An administrator could log in and manually issue the commands and add up the various results to check CPU usage.

chkp-gaia-vs-cpu-vsx

#! META
name: chkp-gaia-vs-cpu-vsx
description: Records the CPU usage for virtual systems
type: monitoring
monitoring_interval: 2 minutes
includes_resource_data: true
requires:
    vendor: "checkpoint"
    os.name: "gaia"
    vsx: "true"
    role-firewall: "true"

#! COMMENTS
vs-cpu-usage:
    why: |
        High CPU usage could cause traffic to be dropped and may result in notable performance issues.
    how: |
        Indeni issues a combination of Linux and Checkpoint commands to discover the processes and threads associated with a given VS, and then adds up the CPU usage, per CPU core, for each VS. Indeni reports both the average and per core usage, and alerts the user if usage is above a certain threshold.
    without-indeni: |
        An administrator could log in and manually issue the commands and add up the various results to check CPU usage.
    can-with-snmp: fasle
    can-with-syslog: false
    vendor-provided-management: |
        Detailed CPU utilization data is not available for virtual systems, except via CLI. It is possible to also get this in SmartView Monitor but it is off by default.

live-config-only-vs-cpu-usage:
    skip-documentation: true

#! REMOTE::SSH
${nice-path} -n 15 ps -e -o pid,comm | grep fwk[[:digit:]]*_dev | awk '{ fwk_pid = $1; fwk_dev = $2; print("START----"); print(fwk_dev); system("ps -T -p " fwk_pid " -o pid,psr,comm,%cpu"); print("------END") }' | awk '/START----/{ print($0) }; /^fwk/{ fwk_vsid = $0; sub(/fwk/, "", fwk_vsid); sub(/_dev/, "", fwk_vsid); print("VSID: " fwk_vsid) }; /------END/{ print($0) }; /^[[:space:]]*[[:digit:]]/{ print($2 " " $4) }' ; ${nice-path} -n 15 vsx stat -l | awk '/^Name:/{ print $0 }; /^VSID:/{ print $0; vsid = $2; system("cpwd_admin list -ctx " vsid) }' | awk '/^VSID:/{ print("START----"); print($0); next; }; /^Name:/{ print($0); print("------END"); next }; { if($0 !~ /^APP +CTX/) print $3 ; next }' | while read pid; do if echo $pid | egrep "VSID|Name|START|END"; then : ; else ${nice-path} -n 15 ps -T -p $pid -o pid,psr,comm,%cpu | awk '/^[[:space:]]*[[:digit:]]/{ print($2 " " $4) }'; fi; done;


#! PARSER::AWK

# General Notes:
# The overall goal is to find both the average CPU usage AND per core usage for each VS. There is no simple command to
# get this info in Checkpoint.
#
# The first problem is that CPU usage can be distributed across multiple cores. Moreover, a given process (in a VS) may
# run internal threads on different cores. So, we need to somehow discover all of the processes associated with a given
# VS (since there can be multiple VS instances on a given host). Once we know that, we need to look at the CPU usage for
# each thread for each process: different process threads can run on different cores. In this way, we can tally the
# total CPU usage on each core for a given process, and ultimately, a given VS, and then report this usage. If usage is
# over a certain threshold, then indeni will alert.
#
# Command-line Notes:
# The goal of the command is to dump CPU usage data for every thread in every process associated with a given VS. The
# AWK script takes it from there.
#
# The overall structure: two 'big' command lines separated by ;. The reason there are two: essentially, there are two
# different ways to find the processes associated with a given VS. The first is to look at the fwk*_dev processes reported
# by *nix 'ps'. fwk seems to stand for Fire Wall Kernel. Each is numbered per VS instance. The second way to view VS
# processes is using Checkpoint's 'cpwd_admin list -ctx <vsid>' command. For whatever reason, fwk*_dev processes are not
# reported here, but everything else is :) So, we need to look in both places to get all of the processes, and then
# collate the results. So: two command lines separated by ';':
#
# First part of SSH command up to (not including) call to 'vsx stat -l':
# greps ps output for fwk*_dev to get the pid for each fwk_dev process
# for each of these processes
#    using ps, lists all of the threads for the process, including the core ID and CPU usage for each thread
#
# Second part of command line starting with call to 'vsx stat -l':
# greps 'vsx stat -l' for the VSID and VS Name
# for each VSID
#    uses cpwd_admin list -ctx to get all of the process associated with a given VSID
#    for each VS process
#       using ps, lists all of the threads for the process, including the core ID and CPU usage for each thread


#START----
#VSID: 0
#0 0.2
#1 45.8
#0 0.3
#1 40.1
#Name: VS_0
#------END
/^VSID:/ {
    vs_id = $2
    vs_name = ""

    # Loop through the data, summing the CPU usage values.
    get_line_code = getline  # TODO: if the first line we get is END, then we should log an exception
    while ($0 != "------END") {
        if (get_line_code < 1)  # A value of 0 means no more lines (EOF), value < 0 is an Error. Either way, exit the loop.
            break               # Note that this should never happen (but it did :), but this protects the code from an
                                # infinite loop in case there's no '------END' line at the end of the output.
        if ($0 ~ /^Name: /) {
            vsid_to_vs_name[vs_id] = $2
        } else {
            if (NF == 2) {    # TODO: if NF != 2, then we should log an exception
                core_id = $1
                thread_usage = $2

                vs_core = vs_id "," core_id
                if (!(vs_core in cores_used_by_vs)) {
                    cores_used_by_vs[vs_core] = ""
                    num_cores_per_vs[vs_id]++
                }

                total_per_vs[vs_id] += thread_usage
                total_per_vs_per_core[vs_core] += thread_usage
            }
        }

        get_line_code = getline
    }
    vs_name = ""
}

END {
    for (vs_id in total_per_vs) {
        if (vs_id in vsid_to_vs_name) {   # TODO: if vs_id is not in map, we should log an exception
            total_vs_usage = total_per_vs[vs_id]
            vs_avg_usage = total_vs_usage / num_cores_per_vs[vs_id]

            avg_tags["vs.name"] = vsid_to_vs_name[vs_id]
            avg_tags["vs.id"] = vs_id
            avg_tags["resource-metric"] = "true"
            writeDoubleMetricWithLiveConfig("vs-cpu-usage", avg_tags, "gauge", 0, vs_avg_usage, "Average CPU Usage per VS", "percentage", "vs.name|vs.id")
        }
    }

    for (vs_and_core in total_per_vs_per_core) {
        array_length = split(vs_and_core, vs_id_to_core, ",")
        if (array_length > 1) {   # TODO: if array_length is not > 1, we should log an exception
            vs_id = vs_id_to_core[1]
            if (vs_id in vsid_to_vs_name) {   # TODO: if vs_id is not in map, we should log an exception
                core_id = vs_id_to_core[2]
                core_usage = total_per_vs_per_core[vs_and_core]

                per_core_tags["vs.id"] = vs_id
                per_core_tags["cpu.core.id"] = core_id
                # The following metric is solely for reporting in the live-config. Can't use the same metric as above,
                # since it could easily cause in FP alert for per-core usage.
                writeDoubleMetricWithLiveConfig("live-config-only-vs-cpu-usage", per_core_tags, "gauge", 0, core_usage, "CPU Usage per Core per VS", "percentage", "vs.id|cpu.core.id")
            }
        }
    }
}

check_point_vs_cpu

package com.indeni.server.rules.library.checkpoint

import com.indeni.server.rules.RuleContext
import com.indeni.server.rules.library.NumericThresholdOnDoubleMetricWithItemsTemplateRule

case class check_point_vs_cpu() extends NumericThresholdOnDoubleMetricWithItemsTemplateRule(
    ruleName = "check_point_vs_cpu",
    ruleFriendlyName = "Check Point (VSX): Some VSes have high CPU usage",
    ruleDescription = "indeni will alert when a virtual system's CPU utilization is too high.",
    metricName = "vs-cpu-usage",
    threshold = 70.0,
    applicableMetricTag = "vs.name",
    alertItemsHeader = "Affected Virtual Systems",
    alertItemDescriptionFormat = "The current CPU usage is %.0f%%",
    alertDescription = "Some VSes have high CPU utilization. This could mean slowdown of traffic or packet loss.",
    baseRemediationText = "Determine the cause for the high CPU usage of the listed cores. This may indicate a need for more cores needs to be added.\nReview the following article for further information on high CPU utilization on Check Point firewalls. https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solutionid=sk98348"
)()