Some rule tuning

Some rule tuning
0

Hi everyone,


and did an alert tuning exercise for one of our customers today. Listening in, I picked up a few potential rule changes I'd like to float here and get some feedback on. The idea here is to discuss changes to the rules themselves that will impact all customers, hence the desire for feedback. If some of these don't apply to all customers and are just a matter of tuning the customer's specific system, that's fine and we can ignore those here. We just want to avoid tuning the same thing again and again across customers.


Keep in mind - CRITICAL and ERROR alerts are immediately emailed to the user, whereas INFO and WARNING are not. CRITICAL is generally defined as "Device is down, or a current service disruption", ERROR means "You want to handle this over the next few hours", WARNING means "You'd want to clean this up as part of a project" and INFO means "Just for your knowledge, up to you what to do with it."

Log related alerts (chkp_fw_log_increase_rate_high, cross_vendor_log_servers_not_communicating, panw_logs_discarded) - the customer said that they only care if logs aren't being sent for more than a few hours. The reason is that this alert can happen often during maintenance of other systems.


SNMPv2c/v1 used (cross_vendor_snmp_v2) - this should be more of a warning than an error. In reality, many organizations still use SNMPv2.

License expired alert (cross_vendor_license_has_expired), Contract(s) have expired (cross_vendor_contract_expiration) and Certificate expired (cross_vendor_certificate_expiration) - if the license already expired it's probably not of interest. Reduce to warning severity. The ones that are about to expire - keep at ERROR. (this may require duplicating alerts - for "about to expire" vs "expired")

High disk space utilization (high_mountpoint_space_use_by_device) - current threshold is at 80%. Many disks hover at 90% even, with an auto-cleaning mechanism (like PANW). Change the alert to be ERROR at 95%, and duplicate to another rule where it's warning as of 80%.

NIC going down and triggering a cluster failover (especially true for CHKP, and ) - right now we issue multiple alerts: Network port down (cross_vendor_network_port_down), Required interface(s) down (clusterxl_insufficient_nics_novsx) and Cluster member no longer active (cross_vendor_cluster_member_no_longer_active). How do we reduce to one within the current capabilities of the rule engine?

PANW: Application package not up to date (panw_app_lag_check) and Update schedule set to download only (panw_update_action_download_only) - this may need to be reduced to WARN. In some environments this may be a serious error, but in most it is probably on purpose. WDUT?

DNS lookup failure(s) (all_devices_dns_failure) - when do people actually care about this? Should this be WARNING (not emailed) instead of ERROR?


Concurrent connection limit nearing (concurrent_connection_limit_novsx and concurrent_connection_limit_vsx) - currently set at 80%. The customer said he doesn't care unless it get closer to 100%. We can set a WARN at 80% and CRITICAL at 95%.

CHKP Only: Certificate authority not accessible (check_point_ca_not_accessible) - due to maintenance work that can happen often, the customer would like to see this only if it's happening for an hour or two straight.

PANW Only: Work Queue Entries dataplane pool utilization high (panw_dataplane_work_queue_entires.scala) - should be critical? @Brad


Communication between management server and specific devices not working (cross_vendor_connection_from_mgmt_to_device) - the customer considered this critical, but may occur as part of normal maintenance. Should we wait 1hr?


Interface nearing maximum Rx throughput (cross_vendor_interface_rx_utilization) and Interface nearing maximum Tx throughput (cross_vendor_interface_tx_utilization) - the customer doesn't consider this an ERROR because it happens in some remote offices often. They would want to be WARNed and handle over time.

-----------------------


Would love some feedback too from you guys:

Regarding the update alert: I REALLY like the idea that this is alerted on. However, I think I mentioned in the past to or a previous SE on a demo I had that this is too sensitive because of the frequency of failure... but as a group of devices if one is not updated it is very likely a warning indicator but not critical. It should at least be tunable to some extent when you know Palo Alto is having problems with their servers in your particular region, which was the case for me. That being said, could we instead have an auto-remediation script that kicks off to update the failure and say "hey, this didn't work as scheduled, indeni fixed it"? There would be a few minor cautions with this but we could view how they have the update configured and request it be updated with the same criteria.


Thoughts on that?


Yoni- regarding "PANW Only: Work Queue Entries dataplane pool utilization high": I'd need to learn more about this message as I am not familiar with it. Current action/check and customer feedback on it? Thanks!!


Very interesting post.... Here are my thoughts (purely Check Point based as thats what we use Indeni for)

Log related alerts - we get a lot of these either when we push policy to a busy firewall or if the management server is being rebooted - maybe a delay to not alert before 30 minutes has passed ? If still growing , then alert....

Snmpv2/v1 - just turn off the alerts for any devices that still use this... Everyone should be using v3 these days , otherwise all your data and community strings are in clear text !


Licenses - I would leave as-is - when you add a new device you want to know if something is expired - It is better for the user to actually acknowledge that they know about the issue and want to ignore it, than to not be aware of it


High disk space - would be great to be user-adjustable - 80% is a bit low in most cases, maybe raise to 85%

but still a critical at 90% for me


DNS lookups - unless continually failing , these alerts are just noise for me - can be reduced/removed unless not working for over 24 hours


Concurrent connections - i like to know well in advance - i.e current levels, UNLESS another alert like 'aggressive aging' can cover this . I know not everyone has that turned on, so i'd say leave as-is for Check Point


CA not accesible - only if down for multiple hours/day


Comms between mgmt server and gateway - maintenanace mode would be useful here


Throughput issues - for some it is 'known' and can be ignored, however still important to know its happening on a gateway that you dont expect to have high utilisation. Could it also be time-based ? i.e to cover backups overnight but still alert during the day


thats my thoughts

thanks

Peter


It sounds like “customizable thresholds” would be a good feature to add. Have an easy way that customers can change the threshold of an alert without needing to know how to modify rules. Various customers with different business models/cases would very well need different thresholds. Also, if an Indeni customer is providing SLAs to their customers, they could want different thresholds depending on the SLA that their customer purchased.

Very good input! Here are my views


Log related alerts
I don't find the reason that this could alert during maintenance as a valid point. As always when you do maintenance you can trigger
a lot of alerts, since things that should work temporarily stops working.
Here the solution is probably to have a way to suspend all indeni alerting before doing the maintenance.
But we could add a delay of a few minutes to at least avoid very temporary disruptions.

SNMPv2c/v1
I agree that this should be warning only. There is no imediate critical negative effect of this.


License expired alert
I can agree to downgrade when license has expired.

Concurrent connection limit nearing
Sounds good with two limits

High disk space utilization
Sounds good to have two limits with different severity.
This should also be user configurable.


NIC going down and triggering a cluster failover
This is a little tricky as the alerts convey different things.
Network port down - Triggers if we get a link down. The interface might not be monitored by ClusterXL, but a link down can still be interesting. Maybe decrease severity to warning?
Required interface(s) down - If a monitored interface is down. This is important. This may or may not trigger a cluster failover depending if the node is already Standby or not.
Cluster member no longer active - Cluster has failed over

So each of the three alerts have their usage. But if an interface goes down they could all trigger at the same time.
If for example a monitored interface goes down, then the "Network port down" might not be useful to show. But I don't know how that would be done.


DNS lookup failure
This is really using the monitored device to monitor something else. If it happens only a short time this does not affect the monitored device in any way, but it might be important for the rest of the environment. But if we want to change it so that it only alert if it is affecting the device we could only alert if the status is in fail state for 1 hour. Longer than that and it might affect the devices capability to download updates for AV etc.

Certificate authority not accessible
I again do not like the reasoning that it can happen during maintenance and thus we should supress this.
In this specific case however I do not see any imediate negative effect if this is down for 1 hour even if it should indicate that something is wrong. The root cause for why this is not working could cause other unknown issues.

Interface nearing maximum Rx/Tx throughput
I would say this shouldnt happen often :) If it does they can change it to WARN in their environment.