Deployment of the Cisco Nexus vPC technology and Analysis by Indeni (Part 2)

Originally published at: https://indeni.com/deployment-of-the-cisco-nexus-vpc-technology-and-analysis-by-indeni-part-2/


This is the second article from the Cisco Nexus virtual Port Channel (vPC) virtualization technology and Indeni series.

In the first article we discussed the main components, features and several best practices of a Nexus vPC Data Center architecture. A pair of Nexus 5500 switches had been deployed at a lab environment and used for analysis by Indeni to get a dynamic report for the vPC technology and the activated features. Also, we described how easily an Administrator can get access to the vPC Indeni Analysis tabs. Finally, screen captures were provided to describe the NX-OS vPC capabilities offered by Indeni via the Live Config tab. You can easily view Part 1 of this process here.

This article describes the best practices and the critical vPC features which are strongly recommended by Cisco to be deployed to a Nexus Data Center environment. These features can improve the availability and increase the performance of a Cisco Nexus Data Center deployment. The Indeni IKE scripts written by CCIE / IKE experts collect the real time relevant metrics from the Nexus switches in order to analyze the relevant NX-OS outputs and be analyzed by Indeni. The outcome of the analysis would be the cause for Indeni to trigger the vPC alerts in case of an issue. Each alert provides detailed info for the problem or for the misalignment from the best practices, alert severity level and detailed remediation steps. It should be noted that the Indeni database is updated daily with new metrics and alerts not only for the vPC Nexus technology but also for other technologies and features such as the FEX, BGP, OSPF etc.

Are you a Network Administrator, System Engineer, Software Engineer, Indeni Knowlegde Expert (IKEs), Indeni Software Developers or Tech Geek? If yes, then read on! This article is published for you!

Cisco Nexus vPC Failure Scenarios and Indeni

The Cisco Nexus lab environment used in the previous article is going to be used as well as a pair of Nexus 9kv series Switches. In brief the next Cisco Nexus switches, NX-OS versions and Indeni Software Release have been used for testing:
Hostname Model NX-OS Version/Release
N5k-UP N5K-C5548U P-SUP 7.0(2)N1(1)
N5k-down N5K-C5548UP-SUP 7.0(7)N1(1)
N9k-A Nexus9000 9000v Chassis version 7.0(3)I7(3)
N9k-B Nexus9000 9000v Chassis version 7.0(3)I7(3)
Indeni Indeni Installed Indeni packages: indeni-collector 6.1.4.65 indeni-ds 6.1.4.65 indeni-server 6.1.4.65 indeni-triton 6.1.4.65 indeni-vigile 6.1.4.65 indeni-walt 6.1.4.65
 

It should be noted that each of the pairs of N9k and N5k Nexus switches have been configured as vPC peers (total two vPC Nexus clusters). The following scenarios are going to be tested at the lab in order to see how the Indeni platform reacts to these configuration changes, misconfigurations or failures. An overview of these vPC components or features is also provided to highlight the importance of these technologies/features. A list with the failure scenarios is provided below:

  • vPC Peer Keepalive Link Failure
  • vPC Peer Link Failure
  • vPC Peer Switch Failure
  • Dual Active

vPC Peer Keepalive Link failure

The Peer Keepalive Link provides Layer 3 connectivity and is used as a secondary check in order to determine if the remote peer is operating properly. In particular, it is used by the vPC switch in order to determine whether the peer link is not operational or the vPC peer switch is down.

It should be noted that no data or synchronization traffic is sent over the vPC Peer Keepalive Link. In particular, a vPC peer-keepalive packet is a User Datagram Protocol (UDP) message on port 3200 that is 96 bytes long, with a 32-byte payload. Keepalive messages can be easily captured and displayed using the onboard Wireshark Toolkit. A very helpful onboard to Nexus Wireshark Toolkit will be discussed in an upcoming Nexus-Indeni article…that’s a promise :slight_smile:

Cisco recommends the following method of interconnections for the vPC keepalive link in the order of preference based on the Nexus switch model.

Nexus 7000 & 9000 Series Switches Nexus 5000 & 3000 Series Switches
1. Dedicated link(s) (1GE LC) 1. mgmt0 interface (along with management traffic)
2. mgmt0 interface (along with management traffic) 2. Dedicated link(s) (1/10GE front panel ports)
3. As last resort, can be routed in-band over the L3 infrastructure
During a vPC peer keepalive link failure there is no impact on traffic flow but Cisco recommends restoring the peer keepalive link at the earliest to avoid a dual active scenario. In particular vPC peer-keepalive link is leveraged to detect split brain scenario (both vPC peer devices are active-active) when vPC peer-link is down.

For this reason also Cisco highly recommends to not configure vPC peer-keepalive link on top of vPC peer-link; peer-keepalive messages must not be carried over vPC peer-link to avoid fate sharing in case peer-link goes down.

Let’s move to the Indeni Lab and see how Indeni corresponds in case of failure of the vPC keepalive link. The vPC peer keepalive link is removed by the N9k vPC peers. Indeni instantly reacts to this change, analyzes the relevant NX-OS outputs in the backend and triggers the next alert.

The network administrator or NOC is notified of this critical issue and detailed remediation steps are provided. The Administrator follows the remediation steps provided by Indeni, fixes the issue and updates the configuration in order to be aligned with Cisco best practices. As soon as they complete the configuration change the Admin will add a note in Indeni in order to track this change and the colleagues will be notified. This NOTE is highlighted with an arrow to the above screen capture.

vPC Peer Link failure

The vPC peer-link is the most important interconnection in the vPC architecture. In brief, the vPC peer link is used to synchronize the state between vPC peer devices by using control packets. In addition the vPC peer-link is used in a stable network for multicast, broadcast, unknown unicast traffic and for the traffic in case of orphaned ports. It should be noticed that this link is also critical since it is also used to carry packets for protocols like the First Hop Router Protocol e.g. HSRP or VRRP, STP BPDU and IGMP.

Cisco recommends the following guidelines for the vPC peer links:

✓ At least 10GE interfaces should be used for vPC peer links ✓ Point-to-point links should be preferred without other network devices between the vPC peers e.g. L2 switch ✓ The usage of at least two 10Gbps links spread between two line cards should be used to increase the availability ✓ Ports should be dedicated and not in oversubscribed modules. ✓ The vPC Peer-Link and the Peer Keepalive Link should be located on a different I/O module to increase the resilience in case of failure

Let’s move forward and see how Indeni responds in case of vPC peer link failure. The Nexus 9k peer links are manually set to operational down status. As soon as the vPC peer link goes down, the vPC secondary switch shuts down all of its vPC member ports since it still receives keepalive messages from the vPC primary switch. The reception of keepalive messages indicates that the vPC primary switch is still alive. Finally, the vPC primary switch keeps all of its interfaces up. Indeni instantly reacts to this change, analyzes the relevant NX-OS outputs in the backend and triggers the following alert.

The network administrator is notified of this critical issue and detailed remediation steps are provided by Indeni. The Administrator follows the remediation steps provided by Indeni, and identifies that the problem occurred due to a vpc inconsistency. The problem has been resolved and a NOTE has been added for the Level 3 NOC mentioning that the issue has been caused due to a vPC inconsistency and is resolved. This NOTE in Indeni is highlighted with an arrow in the above screen capture.

vPC Peer Switch Failure & Dual Active

A vPC peer switch failure means that all of the traffic will be handled by the other vPC (active) peer switch. This issue would cause service outage for all of the devices connected with a single link to the no operation switch (orphan ports).

Finally it should be noted that the Dual-Active or Split Brain vPC failure scenario occurs when the Peer Keepalive Link fails followed by the Peer-Link. In this case both Nexus switches get the vPC primary roles. If this happens, the vPC primary switch will remain as the primary and the vPC secondary switch will become operational primary causing severe network instability and outage.

Indeni reacts instantly to such failure scenarios by triggering the vPC peer link and vPC keepalive link failures alert along with detailed remediation steps.

vPC Auto-Recovery Feature

The vPC auto-recovery feature was designed to handle two important enhancements to vPC. The first one is to provide a fallback mechanism in case of vPC peer-link failure followed by vPC primary peer device failure. The second enhancement is to handle a specific case where both vPC peer devices reload but only one comes back to life (vPC auto-recovery reload-delay feature). In particular, the vPC auto-recovery reload-delay feature allows the unique alive vPC peer device to assume the vPC primary role and bring up all local vPC ports after the expiration of the delay timer. The delay can be tuned from 240 seconds to 3600 seconds.

Based on Cisco vPC best practices guide the vPC auto-recovery should be enabled on both vPC peer devices. Besides, Cisco recommends always enabling vPC auto-recovery reload-delay on both vPC peer devices.

Let’s move to our lab environment and the N9k configured with a basic vPC setup. The Network Administrator gets a Warning message after Indeni NX-OS outputs analysis informing that the vPC auto recovery feature & reload delay vPC features are not enabled. The Network Administrator uses the Indeni Note to inform his Manager that the vPC configuration is not aligned with the Cisco best practices.

vPC graceful Type-1 consistency check

vPC member ports on both vPC peer switches should have identical parameters like STP, MTU and LACP mode. Any inconsistency in any of these parameters is called Type1 inconsistency. In such inconsistency all vlans on both vpc member ports are brought down causing service outage.

Cisco NX-OS version 5.2 introduced the graceful consistency check feature to soften vPC system reaction in occurrence to type 1 consistency check. In particular, in case of global configuration type 1 inconsistency check, only the vPC member ports on secondary peer device are set to down state and the service can be provided via the primary operational vPC peer switch.

The next table summarizes the global configuration parameters that are taken into account for type-1 consistency check.

Type – 1 consistency check Details
Spanning Tree Protocol (STP) mode RPVST (Rapid Per LAN Spanning Tree) or MST (Multiple Spanning Tree)
STP Enable/disable state per VLAN Yes or No
STP region configuration for Multiple Spanning Tree (MST) Region name, region revision, region instance to VLAN mapping
STP global settings Bridge Assurance settings Port type settings Loop Guard settings BPDU filter settings MST Simulate PVST enabled or disabled
Cisco highly recommends enabling vPC graceful type-1 checks on both vPC peer devices. Indeni performs this check by analyzing the relevant NX-OS outputs and triggers the next alert:

It notices that the provided remediation steps give the relevant commands on how to check if this feature is enabled, how to configure it and finally a reference to the Cisco official best practices guide.

vPC peer gateway

vPC Peer-Gateway allows a vPC peer Nexus switch to become the active gateway for packets addressed to the other peer nexus switch MAC address. It optimizes the packet routing by keeping the forwarding of traffic local to the vPC peer device and avoids the usage of the peer-link. There isn’t any traffic impact when the NX-OS peer gateway feature is activated. This feature is essential to eliminate interoperability issues with some network-attached storage (NAS) or load-balancer devices that do not perform a typical default gateway ARP request at boot up.

It is mentioned in the Cisco official design guide to always enable vPC peer-gateway in the vPC domain if there is no end host e.g. SLB or NAS requirement for this feature.

Indeni analyzes the NX-OS commands output collected by the N9k installed at the lab and an alert about the vPC peer gateway is triggered. The provided remediation steps inform the Administrator how to check if this feature is enabled, how to configure it and finally provide a reference to the Cisco official best practices guide.

vPC Features & Status Overview

Indeni provides an overview for the status of all of the main vPC features. It places them in a category within the live config menu named “vPC Information”. The following screen capture provides the relevant output for the vPC N5k peer switches deployed at the Lab.

Summary

This article describes the main vPC NX-OS features which under specific conditions could highly improve the stability and availability of a Data Center network. N9k and N5k vPC peer switches have been deployed at the lab and Indeni analyzes the Nexus switches. In particular, the Indeni platform continuously analyzes the relevant NX-OS outputs and triggers the relevant alerts with detailed remediation steps on how to check the status of the feature, how to apply it and finally a reference to the relevant Cisco configuration guide or best practices design guide.

This article is the second part discussing the cool features and powerful capabilities that Indeni offers for the NX-OS vPC technology. More articles about the Indeni and the Cisco Nexus technology are coming soon. Stay tuned :relaxed:

Learn more by joining the Indeni Community. If you liked this article please share it by clicking on one of the social media sharing icons at the top of this page. Thank you to Vasileios Bouloukos for his contributions to this article.