Configuration changed on standby member-paloaltonetworks-panos

warn
high-availability
panos
paloaltonetworks
Configuration changed on standby member-paloaltonetworks-panos
1.0 1

#1

Configuration changed on standby member-paloaltonetworks-panos

Vendor: paloaltonetworks

OS: panos

Description:
Generally, making configuration changes to the standby member of a device is not recommended. indeni will trigger an issue if this happens.

Remediation Steps:
Make the configuration changes to the active member of the cluster.

How does this work?
This alert logs into the Palo Alto Networks firewall through SSH and retrieves the difference between the committed configuration and the saved configuration. If a change is found, an alert is issued.

Why is this important?
After changing the configuration of a device it is always important to remember to commit the changes. In the case of Palo Alto Networks, without committing the changes they will not take effect. A common issue is when an administrator makes certain changes, does not commit them, and walks away. Another administrator will log on later, make their own changes and commit them. In the process, they will be committing the other administrator’s changes, potentially causing issues.

Without Indeni how would you find this?
The web interface on a Palo Alto Networks firewall provides an indication of whether or not there is a change which requires committing. Failing to notice that, a user would run into the problem described above.

panos-show-config-diff

#! META
name: panos-show-config-diff
description: check to see if there's a difference in configuration (between the saved config and the committed config) 
type: monitoring
monitoring_interval: 15 minutes
requires:
    vendor: paloaltonetworks
    os.name: panos
    product: firewall

#! COMMENTS
config-unsaved:
    why: |
        After changing the configuration of a device it is always important to remember to commit the changes. In the case of Palo Alto Networks, without committing the changes they will not take effect. A common issue is when an administrator makes certain changes, does not commit them, and walks away. Another administrator will log on later, make their own changes and commit them. In the process, they will be committing the other administrator's changes, potentially causing issues.
    how: |
        This alert logs into the Palo Alto Networks firewall through SSH and retrieves the difference between the committed configuration and the saved configuration. If a change is found, an alert is issued.
    without-indeni: |
        The web interface on a Palo Alto Networks firewall provides an indication of whether or not there is a change which requires committing. Failing to notice that, a user would run into the problem described above.
    can-with-snmp: false
    can-with-syslog: false

#! REMOTE::SSH
show config diff

#! PARSER::AWK
BEGIN {
	unsaved=0
}

# If there's a "@@" or added ("+") or removed ("-") line, then there's a difference.
/(@@|\+|-)/ {
	unsaved=1
}

END {
	writeDoubleMetricWithLiveConfig("config-unsaved",null,"gauge",300,unsaved, "Configuration Unsaved?", "boolean", "")
}

panos-show-high-availability-all-monitoring

#! META
name: panos-show-high-availability-all-monitoring
description: Track health of HA
type: monitoring
monitoring_interval: 5 minute
requires:
    vendor: paloaltonetworks
    os.name: panos
    "high-availability": true

#! COMMENTS
cluster-member-active:
    why: |
        Tracking the state of a cluster member is important. If a cluster member which used to be the active member of the cluster no longer is, it may be the result of an issue. In some cases, it is due to maintenance work (and so was anticipated), but in others it may be due to a failure in the firewall or another component in the network.
    how: |
        This script uses the Palo Alto Networks API to retrieve the status of the high availability function of the firewall and specifically retrieves the local member's state.
    without-indeni: |
        The status of high availability is visible in the web interface, as a widget on the main screen.
    can-with-snmp: true
    can-with-syslog: true
cluster-state:
    why: |
        Tracking the state of a cluster is important. If a cluster which used to be healthy no longer is, it may be the result of an issue. In some cases, it is due to maintenance work (and so was anticipated), but in others it may be due to a failure in the members of the cluster or another component in the network.
    how: |
        This script uses the Palo Alto Networks API to retrieve the status of the high availability function of the cluster and specifically retrieves the local member's and peer's states.
    without-indeni: |
        The status of high availability is visible in the web interface, as a widget on the main screen.
    can-with-snmp: true
    can-with-syslog: true
cluster-preemption-enabled:
    why: |
        Preemption is a function in clustering which sets a primary member of the cluster to always strive to be the active member. The trouble with this is that if the active member that is set with preemption on has a critical failure and reboots, the cluster will fail over to the secondary and then immediately fail over back to the primary when it completes the reboot. This can result in another crash and the process would happen again and again in a loop. The Palo Alto Networks firewalls have a means of dealing with this ( https://live.paloaltonetworks.com/t5/Learning-Articles/Understanding-Preemption-with-the-Configured-Device-Priority-in/ta-p/53398 ) but it is generally a good idea not to have the preemption feature enabled.
    how: |
        This script uses the Palo Alto Networks API to retrieve the status of the high availability function of this cluster member and specifically the preemption setting.
    without-indeni: |
        Going into a preemption loop is difficult to detect. Normally an administrator will notice service disruption. Then through manual inspection the administrator will determine there is a preemption loop.
    can-with-snmp: true
    can-with-syslog: true
cluster-config-synced:
    why: |
        Normally two Palo Alto Networks firewalls in a cluster work together to ensure their configurations are synchronized. Sometimes, due to connectivity or other issues, the configuration sync may be lost. In the event of a fail over, the secondary member will take over but will be running with a different configuration compared to the primary (the original active member). This can result in service disruption.
    how: |
        This script uses the Palo Alto Networks API to retrieve the status of the high availability function of this cluster and specifically the status of the config synchronization.
    without-indeni: |
        The status of configuration sync is visible in the web interface, as a widget on the main screen.
    can-with-snmp: true
    can-with-syslog: true
device-is-passive:
    why: |
        This metric describe whether this device is a passive device. For passive device, port down alert should not be triggered.
    how: |
        This script uses the Palo Alto Networks API to retrieve the active/passive state of the device.
    without-indeni: |
        The active/passive status is visible in the web interface.
    can-with-snmp: true
    can-with-syslog: true
passive-link-state:
    why: |
        This metric describe whether this the passive-link-state is shutdown or auto. If it is shutdown we can use this metric to not to trigger alerts when ports are in power-down state as expected behavior.
    how: |
        This script uses the Palo Alto Networks API to retrieve the passive-link-state state of the device.
    without-indeni: |
        The passive-link-state status can be found via the web interface or the cli.
    can-with-snmp: true
    can-with-syslog: true

#! REMOTE::HTTP
url: /api?type=op&cmd=<show><high-availability><all></all></high-availability></show>&key=${api-key}
protocol: HTTPS

#! PARSER::XML
_vars:
    root: /response/result
_metrics:
    -
        _tags:
            "im.name":
                _constant: "cluster-member-active"
            "name":
                _constant: "Firewall Clustering"
            "live-config":
                _constant: "true"
            "display-name":
                _constant: "Cluster Member State (this)"
            "im.dstype.displayType":
                _constant: "state"
        _temp:
            state:
                _text: "${root}/group/local-info/state"
        _transform:
            _value.double: |
                {
                    if (temp("state") ~ /^(active|active-primary|active-secondary)/) {
                        print "1"
                    } else {
                        print "0"
                    }
                }
    -
        _tags:
            "im.name":
                _constant: "device-is-passive"
        _temp:
            state:
                _text: "${root}/group/local-info/state"
        _transform:
            _value.double: |
                {
                    if (temp("state") ~ /^(active|active-primary|active-secondary)/) {
                        print "0"
                    } else {
                        print "1"
                    }
                }
    -
        _tags:
            "im.name":
                _constant: "passive-link-state"
        _temp:
            "passivelinkstate":
                _count: "${root}/group/local-info/active-passive/passive-link-state[. = 'shutdown']"
        _transform:
            _value.double: |
                {
                    if (temp("passivelinkstate") > 0) {
                        print "0"
                    } else {
                        print "1"
                    }
                }
    -
        _tags:
            "im.name":
                _constant: "cluster-state"
            "name":
                _constant: "Firewall Clustering"
            "live-config":
                _constant: "true"
            "display-name":
                _constant: "Cluster State"
            "im.dstype.displayType":
                _constant: "state"
        _temp:
            localstate:
                _text: "${root}/group/local-info/state"
            peerstate:
                _text: "${root}/group/peer-info/state"
        _transform:
            _value.double: |
                {
                    if (temp("localstate") != "down" && temp("peerstate") != "down" && temp("peerstate") != "unknown" && temp("peerstate") != "suspended") {
                        print "1"
                    } else {
                        print "0"
                    }
                }
    -
        _tags:
            "im.name":
                _constant: "cluster-config-synced"
            "live-config":
                _constant: "true"
            "display-name":
                _constant: "Cluster Configuration Synced"
            "im.dstype.displayType":
                _constant: "boolean"
        _temp:
            runningsync:
                _text: "${root}/group/running-sync"
        _transform:
            _value.double: |
                {
                    if (temp("runningsync") == "synchronized") {
                        print "1"
                    } else {
                        print "0"
                    }
                }
    -
        _tags:
            "im.name":
                _constant: "cluster-preemption-enabled"
            "live-config":
                _constant: "true"
            "display-name":
                _constant: "Preemption Enabled"
            "im.dstype.displayType":
                _constant: "boolean"
        _temp:
            preemptive:
                _text: "${root}/group/local-info/preemptive"
        _transform:
            _value.double: |
                {
                    if (temp("preemptive") == "yes") {
                        print "1"
                    } else {
                        print "0"
                    }
                }

panos-show-high-availability-all-monitoring

#! META
name: panos-show-high-availability-all-monitoring
description: Track health of HA
type: monitoring
monitoring_interval: 5 minute
requires:
    vendor: paloaltonetworks
    os.name: panos
    "high-availability": true

#! COMMENTS
cluster-member-active:
    why: |
        Tracking the state of a cluster member is important. If a cluster member which used to be the active member of the cluster no longer is, it may be the result of an issue. In some cases, it is due to maintenance work (and so was anticipated), but in others it may be due to a failure in the firewall or another component in the network.
    how: |
        This script uses the Palo Alto Networks API to retrieve the status of the high availability function of the firewall and specifically retrieves the local member's state.
    without-indeni: |
        The status of high availability is visible in the web interface, as a widget on the main screen.
    can-with-snmp: true
    can-with-syslog: true
cluster-state:
    why: |
        Tracking the state of a cluster is important. If a cluster which used to be healthy no longer is, it may be the result of an issue. In some cases, it is due to maintenance work (and so was anticipated), but in others it may be due to a failure in the members of the cluster or another component in the network.
    how: |
        This script uses the Palo Alto Networks API to retrieve the status of the high availability function of the cluster and specifically retrieves the local member's and peer's states.
    without-indeni: |
        The status of high availability is visible in the web interface, as a widget on the main screen.
    can-with-snmp: true
    can-with-syslog: true
cluster-preemption-enabled:
    why: |
        Preemption is a function in clustering which sets a primary member of the cluster to always strive to be the active member. The trouble with this is that if the active member that is set with preemption on has a critical failure and reboots, the cluster will fail over to the secondary and then immediately fail over back to the primary when it completes the reboot. This can result in another crash and the process would happen again and again in a loop. The Palo Alto Networks firewalls have a means of dealing with this ( https://live.paloaltonetworks.com/t5/Learning-Articles/Understanding-Preemption-with-the-Configured-Device-Priority-in/ta-p/53398 ) but it is generally a good idea not to have the preemption feature enabled.
    how: |
        This script uses the Palo Alto Networks API to retrieve the status of the high availability function of this cluster member and specifically the preemption setting.
    without-indeni: |
        Going into a preemption loop is difficult to detect. Normally an administrator will notice service disruption. Then through manual inspection the administrator will determine there is a preemption loop.
    can-with-snmp: true
    can-with-syslog: true
cluster-config-synced:
    why: |
        Normally two Palo Alto Networks firewalls in a cluster work together to ensure their configurations are synchronized. Sometimes, due to connectivity or other issues, the configuration sync may be lost. In the event of a fail over, the secondary member will take over but will be running with a different configuration compared to the primary (the original active member). This can result in service disruption.
    how: |
        This script uses the Palo Alto Networks API to retrieve the status of the high availability function of this cluster and specifically the status of the config synchronization.
    without-indeni: |
        The status of configuration sync is visible in the web interface, as a widget on the main screen.
    can-with-snmp: true
    can-with-syslog: true
device-is-passive:
    why: |
        This metric describe whether this device is a passive device. For passive device, port down alert should not be triggered.
    how: |
        This script uses the Palo Alto Networks API to retrieve the active/passive state of the device.
    without-indeni: |
        The active/passive status is visible in the web interface.
    can-with-snmp: true
    can-with-syslog: true
passive-link-state:
    why: |
        This metric describe whether this the passive-link-state is shutdown or auto. If it is shutdown we can use this metric to not to trigger alerts when ports are in power-down state as expected behavior.
    how: |
        This script uses the Palo Alto Networks API to retrieve the passive-link-state state of the device.
    without-indeni: |
        The passive-link-state status can be found via the web interface or the cli.
    can-with-snmp: true
    can-with-syslog: true

#! REMOTE::HTTP
url: /api?type=op&cmd=<show><high-availability><all></all></high-availability></show>&key=${api-key}
protocol: HTTPS

#! PARSER::XML
_vars:
    root: /response/result
_metrics:
    -
        _tags:
            "im.name":
                _constant: "cluster-member-active"
            "name":
                _constant: "Firewall Clustering"
            "live-config":
                _constant: "true"
            "display-name":
                _constant: "Cluster Member State (this)"
            "im.dstype.displayType":
                _constant: "state"
        _temp:
            state:
                _text: "${root}/group/local-info/state"
        _transform:
            _value.double: |
                {
                    if (temp("state") ~ /^(active|active-primary|active-secondary)/) {
                        print "1"
                    } else {
                        print "0"
                    }
                }
    -
        _tags:
            "im.name":
                _constant: "device-is-passive"
        _temp:
            state:
                _text: "${root}/group/local-info/state"
        _transform:
            _value.double: |
                {
                    if (temp("state") ~ /^(active|active-primary|active-secondary)/) {
                        print "0"
                    } else {
                        print "1"
                    }
                }
    -
        _tags:
            "im.name":
                _constant: "passive-link-state"
        _temp:
            "passivelinkstate":
                _count: "${root}/group/local-info/active-passive/passive-link-state[. = 'shutdown']"
        _transform:
            _value.double: |
                {
                    if (temp("passivelinkstate") > 0) {
                        print "0"
                    } else {
                        print "1"
                    }
                }
    -
        _tags:
            "im.name":
                _constant: "cluster-state"
            "name":
                _constant: "Firewall Clustering"
            "live-config":
                _constant: "true"
            "display-name":
                _constant: "Cluster State"
            "im.dstype.displayType":
                _constant: "state"
        _temp:
            localstate:
                _text: "${root}/group/local-info/state"
            peerstate:
                _text: "${root}/group/peer-info/state"
        _transform:
            _value.double: |
                {
                    if (temp("localstate") != "down" && temp("peerstate") != "down" && temp("peerstate") != "unknown" && temp("peerstate") != "suspended") {
                        print "1"
                    } else {
                        print "0"
                    }
                }
    -
        _tags:
            "im.name":
                _constant: "cluster-config-synced"
            "live-config":
                _constant: "true"
            "display-name":
                _constant: "Cluster Configuration Synced"
            "im.dstype.displayType":
                _constant: "boolean"
        _temp:
            runningsync:
                _text: "${root}/group/running-sync"
        _transform:
            _value.double: |
                {
                    if (temp("runningsync") == "synchronized") {
                        print "1"
                    } else {
                        print "0"
                    }
                }
    -
        _tags:
            "im.name":
                _constant: "cluster-preemption-enabled"
            "live-config":
                _constant: "true"
            "display-name":
                _constant: "Preemption Enabled"
            "im.dstype.displayType":
                _constant: "boolean"
        _temp:
            preemptive:
                _text: "${root}/group/local-info/preemptive"
        _transform:
            _value.double: |
                {
                    if (temp("preemptive") == "yes") {
                        print "1"
                    } else {
                        print "0"
                    }
                }

cross_vendor_config_change_on_standby

package com.indeni.server.rules.library

import com.indeni.apidata.time.TimeSpan
import com.indeni.ruleengine.expressions.conditions.{And, Equals, GreaterThanOrEqual}
import com.indeni.ruleengine.expressions.core._
import com.indeni.ruleengine.expressions.data.{SelectTagsExpression, SelectTimeSeriesExpression, TimeSeriesExpression}
import com.indeni.server.common.data.conditions.True
import com.indeni.server.rules.library.core.PerDeviceRule
import com.indeni.server.rules.{RuleContext, _}
import com.indeni.server.sensor.models.managementprocess.alerts.dto.AlertSeverity


case class ConfigChangeOnStandbyMemberRule() extends PerDeviceRule with RuleHelper {

  override val metadata: RuleMetadata = RuleMetadata.builder("cross_vendor_config_change_on_standby", "Clustered Devices: Configuration changed on standby member",
    "Generally, making configuration changes to the standby member of a device is not recommended. indeni will trigger an issue if this happens.",
    AlertSeverity.WARN).interval(TimeSpan.fromMinutes(5)).build()

  override def expressionTree(context: RuleContext): StatusTreeExpression = {
    val configUnsavedValue = TimeSeriesExpression[Double]("config-unsaved").last
    val memberStateValue = TimeSeriesExpression[Double]("cluster-member-active").last
    val configSyncValue = TimeSeriesExpression[Double]("cluster-config-synced").last

    StatusTreeExpression(
      // Which objects to pull (normally, devices)
      SelectTagsExpression(context.metaDao, Set(DeviceKey), True),

      StatusTreeExpression(
        // The time-series we check the test condition against:
        SelectTimeSeriesExpression[Double](context.tsDao, Set("config-unsaved", "cluster-member-active", "cluster-config-synced"), denseOnly = false),

        // The condition which, if true, we have an issue. Checked against the time-series we've collected
        And(
          Equals(configUnsavedValue, ConstantExpression(Some(1.0))),
          Equals(memberStateValue, ConstantExpression(Some(0.0))),
          GreaterThanOrEqual(configSyncValue, ConstantExpression(Some(0.0))))
      ).withoutInfo().asCondition()

      // Details of the alert itself
    ).withRootInfo(
      getHeadline(),
      ConstantExpression("The configuration has been changed on this device, but it's not the active member of the cluster. Best practices recommend making changes to the active member of a cluster and then syncing to the standby."),
      ConditionalRemediationSteps("Make the configuration changes to the active member of the cluster.",
        ConditionalRemediationSteps.OS_NXOS ->
          """1. Save the configuration by executing the "copy running startup config" command. Note: Network admin role is required to execute this command.
            |2. Check that there are not unsaved configuration changes by running the “show running-config diff” command to the switches.
            |3. Consider creating snapshots of the configuration by utilizing the Checkpoint and Rollback NX-OS features. The NX-OS checkpoint and rollback feature are extremely useful, and a life saver in some cases, when a new configuration change to a production system has caused unwanted effects or was incorrectly made/planned and we need to immediately return to an original/stable configuration.
            |4. For more information review the following article: <a target="_blank" href="http://www.firewall.cx/cisco-technical-knowledgebase/cisco-data-center/1202-cisco-nexus-checkpoint-rollback-feature.html">Guide to Nexus checkpoint & rollback feature</a>""".stripMargin,
        ConditionalRemediationSteps.VENDOR_JUNIPER ->
          """|1. The chassis cluster synchronization feature automatically synchronizes the configuration from the primary node to the secondary node when the secondary joins the primary as a cluster.
             |2. Review the following article on Juniper tech support site: <a target="_blank" href="https://www.juniper.net/documentation/en_US/junos/topics/concept/chassis-cluster-backup-config-sync.html">Understanding Automatic Chassis Cluster Synchronization Between Primary and Secondary Nodes</a>""".stripMargin
      )
    )
  }
}


#2

This is somewhat irrelevant on a Palo Alto HA deployment. This should be deprecated for Palo Alto in my opinion.

Changes can be made on the secondary and they sync to the primary just the same as they are done in reverse with a few exceptions if multiple people are making changes and the primary has a config-lock. While not typical, there isn’t a reason you couldn’t do it that way and be successful. Most changes would just be made from Panorama though really.

Quoted from the admin guide:
“To avoid configuration conflicts, always make configuration changes on the active (active/passive) or active-primary (active/active) peer and wait for the changes to sync to the peer before making any additional configuration changes.
Only committed configurations synchronize between HA peers. Any configuration in the commit queue at the time of an HA sync will not be synchronized.”

As a firewall admin I would appreciate Indeni showing the config diff adding value to this alert should it be kept. Another though is this alert would clear after the commit and would only be caught if the commit didn’t occur before the polling of the script occurred. Therefore a config change could occur on the secondary without Indeni knowing anyway.

Also, there are several items on a standby member that need to be configured locally that are not synchronized to the primary. These items would trigger this alert until they are committed.
https://docs.paloaltonetworks.com/pan-os/7-1/pan-os-admin/high-availability/reference-ha-synchronization