High CPU usage per core(s)-juniper-junos

discobot · November 21, 2018, 12:45pm

High CPU usage per core(s)-juniper-junos

Vendor: juniper

OS: junos

Description:
High CPU usage is a symptom of a system which is unable to handle " +
"the required load or a symptom of a specific issue with the system " +
"and the applications and services running on it. Indeni will monitor the CPU usage " +
"of each core separately and alert if any of the cores’ CPU usage crosses the threshold.

Remediation Steps:
Determine the cause for the high CPU usage of the listed cores.
|||The Juniper SRX device may start dropping packets if CPU utilization reaches 100%. In order to determine the root cause of high CPU usage:
|1. Check the CPU status in the routing engine by running “show chassis routing-engine” in the command-line interface.
|2. Identify the top running processes which hold most of CPU cycle by running the command "show system processes extensive

How does this work?
This script run the “show chassis routing-engine node X” command via SSH connection to retrieve the routing engine CPU usage.

Why is this important?
Control and data plane CPU utilization is important to track to ensure smooth operation. A high CPU utilization of the control plane may impact the management interface, while a high CPU utilization in the data plane may impact traffic handling.

Without Indeni how would you find this?
An administrator needs to log in the device to run the “show chassis routing-engine node X” command to retrieve CPU utilization information at both the control and data plane levels.

junos-show-chassis-routing-engine-cluster

name: junos-show-chassis-routing-engine-cluster
description: Retrieve the statistics and memory for the Routing Engine(CPU/mem).
type: monitoring
includes_resource_data: true
monitoring_interval: 1 minute
requires:
  vendor: juniper
  os.name: junos
  product: firewall
  high-availability: true
comments:
  cpu-usage:
    why: |
      Control and data plane CPU utilization is important to track to ensure smooth operation.
      A high CPU utilization of the control plane may impact the management interface, while a high CPU utilization in the data plane may impact traffic handling.
    how: |
      This script run the "show chassis routing-engine node X" command via SSH connection to retrieve the routing engine CPU usage.
    can-with-snmp: false
    can-with-syslog: false
  memory-usage:
    why: |
      The various memory components are important to ensure smooth operation. They include control and data plane memory usage.
    how: |
      This script run the "show chassis routing-engine node X" command via SSH connection to retrieve the memory usage for both Control and Data plane.
    can-with-snmp: false
    can-with-syslog: false
  memory-total-kbytes:
    why: |
      Tracking total memory on the system is critical to evaluate and assess current memory utilizatiion.
    how: |
      This script run the "show chassis routing-engine node X" command via SSH connection to retrieve the total memory for both Control and Data plane.
    can-with-snmp: false
    can-with-syslog: false
  memory-free-kbytes:
    why: |
      Tracking free memory on the system is critical to evaluate memory utilization and identify possible memory leaks.
    how: |
      This script run the "show chassis routing-engine node X" command via SSH connection to retrieve the free memory for both Control and Data plane.
    can-with-snmp: false
    can-with-syslog: false

steps:
  -   run:
        type: SSH
        command: show chassis hardware node local | display xml
      parse:
        type: XML
        file: show-chassis-routing-engine-cluster.parser.1.xml.yaml
  -   run:
        type: SSH
        command: show chassis routing-engine node ${node} | display xml
      parse:
        type: XML
        file: show-chassis-routing-engine-cluster.parser.2.xml.yaml

high_per_core_cpu_use_by_device

package com.indeni.server.rules.library.core
import com.indeni.apidata.time.TimeSpan
import com.indeni.ruleengine.expressions.conditions.{And, ConditionHelper, GreaterThanOrEqual}
import com.indeni.ruleengine.expressions.core.{StatusTreeExpression, _}
import com.indeni.ruleengine.expressions.data.{SelectTagsExpression, SelectTimeSeriesExpression, TimeSeriesExpression}
import com.indeni.ruleengine.expressions.math.MinExpression
import com.indeni.ruleengine.expressions.scope.ScopeValueExpression
import com.indeni.server.common.ParameterValue
import com.indeni.server.common.data.conditions.True
import com.indeni.server.params.ParameterDefinition
import com.indeni.server.params.ParameterDefinition.UIType
import com.indeni.server.rules._
import com.indeni.server.rules.config.expressions.DynamicParameterExpression
import com.indeni.server.rules.library.{ConditionalRemediationSteps, PerDeviceRule, RuleHelper}
import com.indeni.server.sensor.models.managementprocess.alerts.dto.AlertSeverity


case class HighPerCoreCpuUsageRule() extends PerDeviceRule with RuleHelper {

  private val highThresholdParameterName = "high_threshold_of_cpu_usage"
  private val highThresholdParameter = new ParameterDefinition(highThresholdParameterName,
    "",
    "High Threshold of CPU Usage",
    "What is the threshold for the CPU usage for which once it is crossed an issue will be triggered.",
    UIType.DOUBLE,
    new ParameterValue((70.0).asInstanceOf[Object])
  )

  private val minimumNumberOfCoresParameterName = "higher_than_threshold_cores"
  private val minimumNumberOfCoresParameter = new ParameterDefinition(minimumNumberOfCoresParameterName,
    "",
    "Number of Cores",
    "The number of CPU cores with usage above the value set in " +
      "\"" + highThresholdParameterName + "\"" +
      " before a issue is triggered.",
    UIType.INTEGER,
    new ParameterValue((1).asInstanceOf[Object])
  )

  private val timeThresholdParameterName = "time_threshold"
  private val timeThresholdParameter = new ParameterDefinition(timeThresholdParameterName,
    "",
    "Time Threshold",
    "The CPU cores need to remain above the value set in " +
      "\"" + highThresholdParameterName + "\"" +
      " for this amount of time before a issue is triggered.",
    UIType.TIMESPAN,
    TimeSpan.fromMinutes(10))


  override def metadata: RuleMetadata = RuleMetadata.builder("high_per_core_cpu_use_by_device", "High CPU usage per core(s)",
    "High CPU usage is a symptom of a system which is unable to handle " +
      "the required load or a symptom of a specific issue with the system " +
      "and the applications and services running on it. Indeni will monitor the CPU usage " +
      "of each core separately and alert if any of the cores' CPU usage crosses the threshold.",
    AlertSeverity.ERROR,
    Set(RuleCategory.HealthChecks),
    deviceCategory = DeviceCategory.AllDevices).
    configParameters(highThresholdParameter, minimumNumberOfCoresParameter, timeThresholdParameter).
    build()

  override def expressionTree(context: RuleContext): StatusTreeExpression = {
    val inUseValue = MinExpression(TimeSeriesExpression[Double]("cpu-usage"))

    StatusTreeExpression(
      // Which objects to pull (normally, devices)
      SelectTagsExpression(context.metaDao, Set(DeviceKey), True),

      // What constitutes an issue
      StatusTreeExpression(

        // The additional tags we care about (we'll be including this in alert data)
        SelectTagsExpression(context.tsDao, Set("cpu-id", "cpu-is-avg"), True),

        StatusTreeExpression(
          // The time-series we check the test condition against:
          SelectTimeSeriesExpression[Double](context.tsDao, Set("cpu-usage"), historyLength = getParameterTimeSpanForRule(timeThresholdParameter), denseOnly = true),

          // The condition which, if true, we have an issue. Checked against the time-series we've collected
          And(
            ScopeValueExpression("cpu-is-avg").visible().isIn(Set("true")).not,
            GreaterThanOrEqual(
              inUseValue,
              getParameterDouble(highThresholdParameter)))

          // The Alert Item to add for this specific item
        ).withSecondaryInfo(
          scopableStringFormatExpression("${scope(\"cpu-id\")}"),
          scopableStringFormatExpression("Current CPU utilization is: %.0f%%", inUseValue),
          title = "Cores with High CPU Usage"
        ).asCondition()
      ).withoutInfo().asCondition(minimumIssueCount = DynamicParameterExpression.withConstantDefault(minimumNumberOfCoresParameter.getName, minimumNumberOfCoresParameter.getDefaultValue.asInteger().intValue()))
    ).withRootInfo(
      getHeadline(),
      ConstantExpression("Some CPU cores are under high usage."),
      ConditionalRemediationSteps("Determine the cause for the high CPU usage of the listed cores.",
        RemediationStepCondition.VENDOR_CP -> "An extremely helpful article on high CPU utilization on Check Point firewalls is available here: <a target=\"_blank\" href=\"https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solutionid=sk98348\">Best Practices - Security Gateway Performance</a>.",
        RemediationStepCondition.VENDOR_PANOS ->
          """|For MP (management plane) CPU, look at <a target="_blank" href="https://live.paloaltonetworks.com/t5/Featured-Articles/Tips-amp-Tricks-Reducing-Management-Plane-Load/ta-p/64681">Tips & Tricks: Reducing Management Plane Load</a>.
             |For DP (data plane), read <a target="_blank" href="https://live.paloaltonetworks.com/t5/Featured-Articles/How-to-Troubleshoot-High-Dataplane-CPU/ta-p/73000">How to Troubleshoot High Dataplane CPU</a>.""".stripMargin,
        RemediationStepCondition.VENDOR_JUNIPER ->
          """|The Juniper SRX device may start dropping packets if CPU utilization reaches 100%. In order to determine the root cause of high CPU usage:
             |1. Check the CPU status in the routing engine by running "show chassis routing-engine" in the command-line interface.
             |2. Identify the top running processes which hold most of CPU cycle by running the command "show system processes extensive". Consider restarting or ending processes if too many events are being handled (e.g. sampling, traceoptions, syslog, snmp).
             |3. Check CPU utilization in the forwarding engine by running "show chassis forwarding". If CPU is high it may be indicative of the device reaching capacity.
             |See <a target="_blank" href="https://kb.juniper.net/InfoCenter/index?page=content&id=KB20989">Juniper Tech Library</a> on CPU load thresholds or contact Juniper technical support for further troubleshooting.""".stripMargin,
        RemediationStepCondition.VENDOR_FORTINET ->
          """
             |1. Login via https to the Fortinet firewall and go to menu "System > Dashboard > Status" and look at the system resources widget to review the current CPU utilization graph.
             |2. Login via ssh to the Fortinet firewall and run the FortiOS command “get system performance status”. The first line of output shows the CPU usage by category. The other lines of the output, such as average network usage, average session setup rate, viruses caught, and IPS attacks blocked can also help to determine why system resource usage is high. For example, if network usage is high it will result in high traffic processing on the FortiGate; or if the session setup rate is very low or zero the proxy may be overloaded and not able to do its job.
             |3. Login via ssh to the Fortinet firewall and run the FortiOS command “get system performance top”. This command shows all the top processes running on the FortiGate unit and their CPU usage. If a process is using most of the CPU cycles, investigate it to determine if it’s normal activity. If the top few entries are using most of the CPU, note which processes they are and investigate those features to try and reduce their CPU load. Some common examples of processes you will see include:  ipsengine, scanunitd (antivirus), iked and sshd.
             |4. For more information review: https://docs.fortinet.com/uploaded/files/2924/troubleshooting-54.pdf
             |5. If the problem persists, contact Fortinet Technical support at https://support.fortinet.com/ for further assistance.""".stripMargin,
        RemediationStepCondition.VENDOR_BLUECOAT ->
        """
            |1. Login via https to the ProxySG and go to Statistics > System > Resources > CPU. Review the current CPU utilization graph.
            |2. Login via ssh to the ProxySG and run the command "show cpu" which provides information about current memory usage.
            |3. Try to troubleshoot the cause using the CPU monitor feature: https://x.x.x.x:8082/Diagnostics/CPU_Monitor/statistics
            |4. Check the ICAP service maximum number of connections. For more information review the following Bluecoat guide:
            |- https://support.symantec.com/en_US/article.TECH242911.html
            |5. If the problem persists, contact Symantec Technical support at https://support.symantec.com for further assistance.
        """.stripMargin,
        RemediationStepCondition.OS_CISCO_ASA ->
          """
            |1. Issue the “show cpu usage” command, and verify the cpu utilization.
            |2. Identify the processes that consume most of the CPU resources with the next command “show processes cpu-usage sorted non-zero”.
            |3. Run the “show memory detail” command, and verify that the memory used by the ASA is normal utilization.
            |4. Verify that the connection count in “show xlate count” is low.
            |5. Check for input and output errors with the “show interface” command since the errors affects the cpu utilization and to the dispatch process
            |6. Issue the “show traffic” command to identify any traffic spike.
            |7. Use the “show local-host” command in order to see if the network experiences a denial-of-service attack
            |8. For more information refer to the next official Cisco troubleshooting <a target=ֿֿ"_blank" href="https://www.cisco.com/c/en/us/support/docs/security/asa-5500-x-series-next-generation-firewalls/113185-asaperformance.html">guide</a>.
            |9. If the problem is not identified review the cisco bug repository for known bugs and finally contact the Cisco Technical Assistance Center (TAC)
          """.stripMargin,
      )
    )
  }
}