VyOS and Redundancy – Part 2

VyOS and Redundancy – Part 2

In Part 1 of my exploration of router redundancy, I mentioned several issues that I discovered with setting up VRRP with VyOS.

  1. My impetus to walk down the path of redundant routers started because I wanted to standardize the configurations. One router had individual interfaces (one eth# interface per VLAN) and the other had a VLAN Trunk. I decided to ensure that the router VMs did not need to come down terribly often in the event that I wanted to update the VLANs it routed for. That meant setting up VLAN Trunks to both devices.
  2. The nature of VRRP in RFC3768 Compatibility mode has special significance on a hypervisor. These issues are not unique to redundant routers, but do require special configuration of the virtual machine in order to convince the hypervisor to let us be redundant. The solution to this was surprisingly simple, though I struggled to get this working for a fairly long time.
  3. My primary router will have been handling connections for other network devices like Servers and Client workstations at the time it will have to perform a failover. Sometimes this isn’t an issue. When it was performing network address translation, it is very important for the standby router to know what the primary was doing to fail over seamlessly.
  4. My particular setup has a flaw in that the router’s unique address on each interface and the virtual address have the exact same metric. This leads connections either to choose at random between the two equal interfaces, or default to the unique interface instead of the virtual one. Even if the problems in #3 are solved, it won’t mean anything as the failover router will not be able to pass traffic for an address it won’t own.

Standardizing Configurations

Hyper-V Virtual Machine network adapters are configured by default in Access mode. This means that they will be configured for one single VLAN, either untagged if not configured or to a specific VLAN ID as set up in the VM’s settings. This is the only way you can configure a network adapter within the Hyper-V User Interface, but is common for most VMs that might ever be made. PowerShell is required in order to set up a VM network adapter into another mode, such as Trunk mode. This can be done by using the PowerShell commandlet called Get-VMNetworkAdapterVlan.

The below example shows how I’ve configured my network adapter for one of my redundant routers.

PS C:\> Get-VM -Name RouterB | Get-VMNetworkAdapterVlan | Format-List

VMName : RouterB
AdapterName : Network Adapter
OperationMode : Trunk
NativeVlanId : 0
AllowedVlanIdList : 2,20,101,201,239-240,250

The NativeVlanId has only ever worked for me when set to 0 and including what will be the untagged VLAN in the list of allowed VLAN IDs. I then configured that VLAN on the switch itself to place untagged packets on the intended VLAN. One can set these values by using the Set-VMNetworkAdapterVlan command as shown below. One gotcha in the command below is that the parameter -AllowedVlanIdList expects a string not an array of digits. Without the quotes, PowerShell turns the list into an array, so we’ll need to use either single or double quotes to ensure PowerShell treats it like a string.

PS C:\> Set-VMNetworkAdapterVlan -VMName RouterB -Trunk -AllowedVlanIdList '2,20,101,201,239-240,250' -NativeVlanId 0

That takes care of the configuration at the hypervisor level. Logging into RouterB, I am able to configure the interfaces through the use of virtual interfaces. VyOS happily treats this as a request to perform 802.1Q VLAN tagging.

set interfaces ethernet eth0 address '192.168.101.8/24'
set interfaces ethernet eth0 description 'Physical Network (Internal)'
set interfaces ethernet eth0 vif 2 address '10.1.10.4/24'
set interfaces ethernet eth0 vif 2 description 'External Network'

And so on for each VLAN desired. The default eth0 interface corresponds to the untagged interface, so address and wire accordingly. Doing things this way allows me, and consistently between the routers, allows me to refer to these virtual interfaces in the format <baseinterface>.<vlanid> or, as with the example above, eth0.2.

Configuring for RFC3768 Compatibility

RFC3768 Compatibility isn’t strictly required when dealing with redundant routers, but for what it does, it does make transitions a lot less problematic. You can view the actual RFC at the IETF page for RFC3768 (Section 7.3 specifically) or head over to the VRRP Wikipedia page and note the opening lines in the Implementation section for a broader overview.

Simply put, this compatibility mode standardizes the MAC address that is used, based on the VRRP group number, in order to construct a virtual hardware address. This reduces client device lag time in realizing that the hardware device they had been communicating with for the virtual IP address is not there, to perform an ARP query to locate it again, and resume communication. In other words, it is very useful in crafting a seamless transition from one router device to another.

It is easy enough to implement VRRP using hardware devices, as most network cards don’t mind what MAC address you tell them to use. Modern hypervisors are a bit different in that they offer protections to insulate a badly behaving VM from performing a function outside the scope it is configured for. Spoofing their MAC address, usually something told to the VM by the hypervisor for each interface, is not among the list of normally allowed behaviors. This means we need to perform some additional configuration on our VMs in order to get this working. Without it, the two VMs will be unable to find each other through the Multicast group address designated for VRRP and be unable to join into a group.

Configuring this in Hyper-V is relatively simple, using the Enable MAC address spoofing option underneath the Advanced Features section of the trunked adapter.

I had done this and tried to get the two routers to group together a number of times, yet they refused. Each time they came up and proclaimed themselves master. Packet traces did not lead me to any good conclusion either, as I seemed to be missing packets in one direction or another. This led me down a rabbit hole of trying to determine whether or not Hyper-V had some refusal to route multicast traffic over its virtual switch, whether I had the trunk configuration correct, the configuration on the switch side, missing references for the igmp proxying turned on. Eventually, I stumbled across a glitch that had me wondering more and more whether I was off base in assuming it was a design issue and, after simply rebooting one of the hypervisors, both routers finally came online and began chatting away, one as Master and one as Backup. Quick lesson for those reading, if things look weird, try the simple solutions first. Only if it continues to misbehave should you start going on that goose chase.

Synchronizing the Connection Tables

As mentioned before, just because the two routers can now share a virtual IP address and a virtual MAC address doesn’t make seamless failover possible. Connection state is an important commodity that must be known by both routers before the backup will be able to cover for the primary in the event of failover. VyOS does implement a method to perform this, using the conntrack-sync module. Beneath the hood, this module makes use of conntrackd to do the work, just like implementing VRRP for VyOS is keepalived.

Conntrack-sync can best be understood by looking at how the two routers will ultimately synchronize the connection state. The router’s own connection table is referred to as the internal connection table while connections synchronized from partner routers form the external connection table. A failover mechanism is specified in the conntrack-sync configuration, indicating how the connections will be failing over. Likely, though I have yet to test this, when the group fails over, matching connections will move from the external to the internal connection tables and the new primary router will relay packets as if nothing had happened.

I got my router pair synchronizing their connection tables by a simple set of commands. You are required specify which interface they will communicate over and what sync group it is for. Past that, the settings are available to tune the table state for your environment. My environment is fairly small, so some of the settings I used might be excessive.

set service conntrack-sync event-listen-queue-size '8'
set service conntrack-sync failover-mechanism vrrp sync-group 'RouterCore'
set service conntrack-sync interface 'eth0'
set service conntrack-sync sync-queue-size '1'

Routing Flaws

Where things fall down is in the routing and decision making process the VyOS system is making on which egress point to send traffic. In the tcpdump capture below, you can see me ping from an internal host (not shown) through the router. The capture is on the external interface/VLAN following masquerading. The virtual address for this particular interface is 10.10.10.2, by the capture you can tell that it is actually using the router’s own unique address to forward the packets through.

vyos@routera:~$ sudo tcpdump -n -i eth0.2 'icmp and host 8.8.8.8'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0.2, link-type EN10MB (Ethernet), capture size 65535 bytes
22:38:59.078391 IP 10.10.10.3 > 8.8.8.8: ICMP echo request, id 1, seq 38, length 40
22:38:59.090693 IP 8.8.8.8 > 10.10.10.3: ICMP echo reply, id 1, seq 38, length 40

This routing flaw is not insignificant and will break seamless failover. I will continue to explore some solutions in Part 3 of this series.