Firewall cluster heartbeat interface problem-fortinet-FortiOS

Firewall cluster heartbeat interface problem-fortinet-FortiOS
0

Firewall cluster heartbeat interface problem-fortinet-FortiOS

Vendor: fortinet

OS: FortiOS

Description:
If heartbeat communications fail, all cluster members assume that they are the primary unit resulting in multiple devices on the network with the same IP-address and MAC-address(condition referred to as Split Brain). The communication will be disrupted until heartbeat communication can be reestablished.

Remediation Steps:

  1. Check that the heartbeat interfaces status of each cluster unit is connected. Check the cables and interface LEDs. Use the Unit Operation dashboard widget, system network interface list, or cluster members list to verify that each interface including heartbeats that should be connected actually is connected. If the link is down re-verify the physical connection. Replacing network cables or switches as required.
    |2. Login via ssh to the Fortinet firewall and run the FortiOS command “get system ha status”. The above command provides the information about the cluster health and operation status, some information about the cluster configuration, and information about how long the cluster has been operating. Besides, it includes information about how the primary unit was selected, configuration synchronization status, usage stats for each cluster unit, heartbeat status, and the relative priorities of the cluster units.
    |3. You should configure and connect redundant heartbeat interfaces so that if one heartbeat interface fails or becomes disconnected, HA heartbeat traffic can continue to be transmitted using the backup heartbeat interface.
    |4. It is recommended to isolate heartbeat interfaces from user networks. Heartbeat packets contain sensitive cluster configuration information and can consume a considerable amount of network bandwidth.
    |5. More information can be found to the “FortiOS™ Handbook - High Availability” and the next link: https://docs.fortinet.com/uploaded/files/3997/fortigate-ha-56.pdf
    |6. Contact Fortinet Technical support at https://support.fortinet.com/ for further assistance.

How does this work?
The script runs the FortiOS command “get system ha status” to retrieve HA heartbeat link status information.

Why is this important?
This setting shows the status of each cluster unit’s heartbeat interfaces. All of the heartbeat interfaces being down will cause a severe network impact. More details can be found here: https://help.fortinet.com/fos50hlp/54/Content/FortiOS/fortigate-high-availability-52/HA_operating.htm

Without Indeni how would you find this?
An administrator can run the FortiOS command “get system ha status” via SSH connection to retrieve the same information.

fortios-get-system-ha-status

name: fortios-get-system-ha-status
description: FortiGate Cluster High Availability
type: monitoring
monitoring_interval: 5 minutes
requires:
    vendor: fortinet
    os.name: FortiOS
    product: firewall
    high-availability: true
comments:
    ha-health-status:
        why: |
            Indicates if all cluster units are operating normally (OK) or if a problem was detected with the cluster. For example, a message similar to ERROR <serial-number> is lost @ <date> <time> appears if one the subordinate units leaves the cluster. More details can be found here: https://help.fortinet.com/fos50hlp/54/Content/FortiOS/fortigate-high-availability-52/HA_operating.htm
        how: |
            The script runs the FortiOS command "get system ha status" to retrieve High Availability status information.
        can-with-snmp: true
        can-with-syslog: true
    ha-health-mode:
        why: |
            This metric collects and displays the HA mode of the cluster, for example, HA A-P or HA A-A. This metric should be the same for all the members of the cluster. More details can be found here: https://help.fortinet.com/fos50hlp/54/Content/FortiOS/fortigate-high-availability-52/HA_operating.htm
        how: |
            The script runs the FortiOS command "get system ha status" to retrieve HA mode information.
        can-with-snmp: true
        can-with-syslog: false
    ha-group-id:
        why: |
            This metric captures the configured group ID of the cluster which should be the same for all the members of the cluster. HA problems or dropped traffic may be noticed in case of misconfiguration of the group id. More details can be found here: https://help.fortinet.com/fos50hlp/54/Content/FortiOS/fortigate-high-availability-52/HA_operating.htm
        how: |
            The script runs the FortiOS command "get system ha status" to retrieve group id information.
        can-with-snmp: true
        can-with-syslog: false
    ha-debug-status:
        why: |
            This metric captures the HA debug status of the cluster. It is recommended to not be enabled for all the members of the cluster to avoid additional HW resourse consumption. More details can be found here: https://help.fortinet.com/fos50hlp/54/Content/FortiOS/fortigate-high-availability-52/HA_operating.htm
        how: |
            The script runs the FortiOS command "get system ha status" to retrieve debug status information.
        can-with-snmp: false
        can-with-syslog: false
    ha-cluster-uptime:
        why: |
            This metric captures the number of days, hours, minutes, and seconds that the cluster has been operating. Any unexpected low uptime should be troubleshot and investigated. More details can be found here: https://help.fortinet.com/fos50hlp/54/Content/FortiOS/fortigate-high-availability-52/HA_operating.htm
        how: |
            The script runs the FortiOS command "get system ha status" to retrieve uptime information.
        can-with-snmp: true
        can-with-syslog: false
    ha-session-pickup:
        why: |
            This metric captures the status of session pickup (enabled or disabled). When session-pickup is enabled, the FGCP synchronizes the primary unit's TCP session table to all cluster units. As soon as a new TCP session is added to the primary unit session table, that session is synchronized to all cluster units. This synchronization happens as quickly as possible to keep the session tables synchronized. If the primary unit fails, the new primary unit uses its synchronized session table to resume all TCP sessions that were being processed by the former primary unit with only minimal interruption. Under ideal conditions all TCP sessions should be resumed. This is not guaranteed though and under less than ideal conditions some TCP sessions may need to be restarted. More details can be found here: https://help.fortinet.com/fos50hlp/54/Content/FortiOS/fortigate-high-availability-52/HA_operating.htm
        how: |
            The script runs the FortiOS command "get system ha status" to retrieve HA session pickup status information.
        can-with-snmp: false
        can-with-syslog: false
    ha-session-override:
        why: |
            This metric captures the status of the override option for the current cluster unit (enable or disable). When override is disabled a cluster may not always renegotiate when an event occurs that affects primary unit selection. For example, when override is disabled a cluster will not renegotiate when you change a cluster unit device priority or when you add a new cluster unit to a cluster. This is true even if the unit added to the cluster has a higher device priority than any other unit in the cluster. Also, when override is disabled a cluster does not negotiate if the new unit added to the cluster has a failed or disconnected monitored interface. More details can be found here: https://help.fortinet.com/fos50hlp/54/Content/FortiOS/fortigate-high-availability-52/HA_operating.htm
        how: |
            The script runs the FortiOS command "get system ha status" to retrieve HA override status information.
        can-with-snmp: false
        can-with-syslog: false
    ha-config-sync-status:
        why: |
            This setting shows whether or not the configurations of each of the cluster units are synchronized. A configuration that is not synchronized can cause service outage in case of a switchover and should be investigated. More details can be found here: https://help.fortinet.com/fos50hlp/54/Content/FortiOS/fortigate-high-availability-52/HA_operating.htm
        how: |
            The script runs the FortiOS command "get system ha status" to retrieve HA configuration sync status information.
        can-with-snmp: false
        can-with-syslog: false
    ha-heartbeat-link-status:
        why: |
            This setting shows the status of each cluster unit's heartbeat interfaces.  All of the heartbeat interfaces being down will cause a severe network impact. More details can be found here: https://help.fortinet.com/fos50hlp/54/Content/FortiOS/fortigate-high-availability-52/HA_operating.htm
        how: |
            The script runs the FortiOS command "get system ha status" to retrieve HA heartbeat link status information.
        can-with-snmp: true
        can-with-syslog: true
    ha-heartbeat-total-bytes:
        why: |
            This metric captures how much data the heartbeat interfaces have processed.  If all the heartbeat interfaces do not receive/send packets, this needs to be troubleshot since it may have a severe network impact. More details can be found here: https://help.fortinet.com/fos50hlp/54/Content/FortiOS/fortigate-high-availability-52/HA_operating.htm
        how: |
            The script runs the FortiOS command "get system ha status" to retrieve the total number of bytes received via the heartbeat interfaces.
        can-with-snmp: false
        can-with-syslog: false
    mon-interface-link-status:
        why: |
            This setting shows the status of each of the monitored interfaces of the cluster. If a monitored interface on the primary unit fails, the cluster renegotiates to select a new primary unit using the process for Primary unit selection. Because the cluster unit with the failed monitored interface has the lowest monitor priority, a different cluster unit becomes the primary unit. The new primary unit should have fewer link failures. More details can be found here: https://help.fortinet.com/fos50hlp/54/Content/FortiOS/fortigate-high-availability-52/HA_operating.htm
        how: |
            The script runs the FortiOS command "get system ha status" to retrieve HA monitored interface status information.
        can-with-snmp: true
        can-with-syslog: true
    mon-interface-total-bytes:
        why: |
            This setting shows how much data the interfaces monitored by the cluster have processed. If all the monitored interfaces do not receive/send packets, this needs to be troubleshot, since this may have severe network impact. More details can be found here: https://help.fortinet.com/fos50hlp/54/Content/FortiOS/fortigate-high-availability-52/HA_operating.htm
        how: |
            The script runs the FortiOS command "get system ha status" to retrieve the total number of bytes received via the monitored interfaces.
        can-with-snmp: false
        can-with-syslog: false
    ha-heartbeat-int-oper-status-over1:
        why: |
            This metric captures if the HA heartbeat keeps cluster units communicating with each other. It is highly recommended by the vendor to have minimum two interfaces per fortinet firewall as heartbeat interfaces (two links). If all the heartbeat interfaces are down, it will cause a severe network outage. This metric checks to see if there are at least two operational HA heartbeat interfaces per firewall (i.e. two links). More details can be found here: https://help.fortinet.com/fos50hlp/54/Content/FortiOS/fortigate-high-availability-52/HA_failoverHeartbeat.htm
        how: |
            The script runs the FortiOS command "get system ha status" to retrieve HA heartbeat interfaces information.
        can-with-snmp: false
        can-with-syslog: false
    model:
        why: |
            Two or more devices which operate as part of a single cluster must be running on the same hardware.
        how: |
            This script logs into the devices to retrieve the hardware model of the device. Indeni then compares the result to the same script run on other members of the same cluster.
        can-with-snmp: false
        can-with-syslog: false
steps:
-   run:
        type: SSH
        command: get system ha status
    parse:
        type: AWK
        file: get_system_ha_status.parser.1.awk

FortinetHaHeartbeatLinkStatusRule

// Deprecation warning : Scala template-based rules are deprecated. Please use YAML format rules instead.

package com.indeni.server.rules.library.templatebased.fortinet

import com.indeni.server.rules.library.templates.StateDownTemplateRule

case class FortinetHaHeartbeatLinkStatusRule() extends StateDownTemplateRule(
  ruleName = "FortinetHaHeartbeatLinkStatusRule",
  ruleFriendlyName = "Fortinet Devices: Firewall cluster heartbeat interface problem",
  ruleDescription = "If heartbeat communications fail, all cluster members assume that they are the primary unit resulting in multiple devices on the network with the same IP-address and MAC-address(condition referred to as Split Brain). The communication will be disrupted until heartbeat communication can be reestablished.",
  metricName = "ha-heartbeat-link-status",
  alertDescription = "Heartbeat communications has failed.",
  baseRemediationText = """1. Check that the heartbeat interfaces status of each cluster unit is connected. Check the cables and interface LEDs. Use the Unit Operation dashboard widget, system network interface list, or cluster members list to verify that each interface including heartbeats that should be connected actually is connected. If the link is down re-verify the physical connection. Replacing network cables or switches as required.
                          |2. Login via ssh to the Fortinet firewall and run the FortiOS command “get system ha status”. The above command provides the information about the cluster health and operation status, some information about the cluster configuration, and information about how long the cluster has been operating. Besides, it includes information about how the primary unit was selected, configuration synchronization status, usage stats for each cluster unit, heartbeat status, and the relative priorities of the cluster units.
                          |3. You should configure and connect redundant heartbeat interfaces so that if one heartbeat interface fails or becomes disconnected, HA heartbeat traffic can continue to be transmitted using the backup heartbeat interface.
                          |4. It is recommended to isolate heartbeat interfaces from user networks. Heartbeat packets contain sensitive cluster configuration information and can consume a considerable amount of network bandwidth.
                          |5. More information can be found to the “FortiOS™ Handbook - High Availability” and the next link: https://docs.fortinet.com/uploaded/files/3997/fortigate-ha-56.pdf
                          |6. Contact Fortinet Technical support at https://support.fortinet.com/ for further assistance.""".stripMargin
)()