Hi everyone,
and did an alert tuning exercise for one of our customers today. Listening in, I picked up a few potential rule changes I'd like to float here and get some feedback on. The idea here is to discuss changes to the rules themselves that will impact all customers, hence the desire for feedback. If some of these don't apply to all customers and are just a matter of tuning the customer's specific system, that's fine and we can ignore those here. We just want to avoid tuning the same thing again and again across customers.
Keep in mind - CRITICAL and ERROR alerts are immediately emailed to the user, whereas INFO and WARNING are not. CRITICAL is generally defined as "Device is down, or a current service disruption", ERROR means "You want to handle this over the next few hours", WARNING means "You'd want to clean this up as part of a project" and INFO means "Just for your knowledge, up to you what to do with it."
Log related alerts (chkp_fw_log_increase_rate_high, cross_vendor_log_servers_not_communicating, panw_logs_discarded) - the customer said that they only care if logs aren't being sent for more than a few hours. The reason is that this alert can happen often during maintenance of other systems.
SNMPv2c/v1 used (cross_vendor_snmp_v2) - this should be more of a warning than an error. In reality, many organizations still use SNMPv2.
License expired alert (cross_vendor_license_has_expired), Contract(s) have expired (cross_vendor_contract_expiration) and Certificate expired (cross_vendor_certificate_expiration) - if the license already expired it's probably not of interest. Reduce to warning severity. The ones that are about to expire - keep at ERROR. (this may require duplicating alerts - for "about to expire" vs "expired")
High disk space utilization (high_mountpoint_space_use_by_device) - current threshold is at 80%. Many disks hover at 90% even, with an auto-cleaning mechanism (like PANW). Change the alert to be ERROR at 95%, and duplicate to another rule where it's warning as of 80%.
NIC going down and triggering a cluster failover (especially true for CHKP, and ) - right now we issue multiple alerts: Network port down (cross_vendor_network_port_down), Required interface(s) down (clusterxl_insufficient_nics_novsx) and Cluster member no longer active (cross_vendor_cluster_member_no_longer_active). How do we reduce to one within the current capabilities of the rule engine?
PANW: Application package not up to date (panw_app_lag_check) and Update schedule set to download only (panw_update_action_download_only) - this may need to be reduced to WARN. In some environments this may be a serious error, but in most it is probably on purpose. WDUT?
DNS lookup failure(s) (all_devices_dns_failure) - when do people actually care about this? Should this be WARNING (not emailed) instead of ERROR?
Concurrent connection limit nearing (concurrent_connection_limit_novsx and concurrent_connection_limit_vsx) - currently set at 80%. The customer said he doesn't care unless it get closer to 100%. We can set a WARN at 80% and CRITICAL at 95%.
CHKP Only: Certificate authority not accessible (check_point_ca_not_accessible) - due to maintenance work that can happen often, the customer would like to see this only if it's happening for an hour or two straight.
PANW Only: Work Queue Entries dataplane pool utilization high (panw_dataplane_work_queue_entires.scala) - should be critical? @Brad
Communication between management server and specific devices not working (cross_vendor_connection_from_mgmt_to_device) - the customer considered this critical, but may occur as part of normal maintenance. Should we wait 1hr?
Interface nearing maximum Rx throughput (cross_vendor_interface_rx_utilization) and Interface nearing maximum Tx throughput (cross_vendor_interface_tx_utilization) - the customer doesn't consider this an ERROR because it happens in some remote offices often. They would want to be WARNed and handle over time.
-----------------------
Would love some feedback too from you guys: