Over the last few weeks (possibly a month), the principle architect at work has been diligently working on converting our single staging environment to a dual staging environment.
The company I work for has two datacenters in the US (geographically separated, yay..!) and over the past few months we have been working on various projects that have required more than a single development or staging environment. This is where our principle Architect comes in, his mission has been to convert our whole puppet code base as well as countless servers in our home office to use a two site format.
He conquered this large mountain of a task, however he ran into an issue the last few weeks that has taken various forms with many many late nights of troubleshooting.
Our dual staging environment uses SRX 650s in a cluster, with EX switches and MX routers to properly emulate our production environments. (with some budget constrained network replacements) It’s cool to see how it’s come together, being able to properly test every piece of our server and network architecture in a closed environment.
However yesterday I was approached by my co-worker and he explained to me this. Part of our production network includes various L2 point-to-point links that allow redundancy and quick communications between each datacenter. He had gotten to the steps of setting up this L2 connection and had cabled up the two networks but was having issues. He explained to me that once he had the cables connected to either side his attempts to ping either endpoint were futile. Curious I logged in to both SRX clusters and ran mtr to either side and sure as hell nothing was getting through. I took the task on and began tracing everything and making notes of which port plugged into which ports.
After gathering all of the data I began looking through logs on the SRXs, but nothing.. I started monitors on either side looking at the vlan that each was bound to. This was where things started looking weird, let’s establish the two sides as Side “A” and Side “B”, side A would show the arp in and out requests, however on the B side, it only showed in and no outs.
Since we were emulating a poorly designed network (it was done by a dev with aspirations of being a sysadmin) and the connections would go:
Weird, right? Oh well, it works and that’s what we have to deal with until the next refresh.
So I decided to add l3 interfaces on the switches, to see if I could spot any problems at the lowest connection point. After doing so I found that the pings between the EX Clusters were STILL dropping horribly. Going one step further I cut the connections on either side to the SRX clusters via that vlan and to my surprise I could ping between the L3 interfaces on the EX clusters.
The next step was adding one SRX cluster back into the fold and seeing what would happen. Starting with A side I enabled the vlan on the aggregate interfaces back to the cluster and watched the pings intensely.
Huh? Pings still worked AND I could ping from the side b ex to the side a srx ip.. Blasphemy. So I enabled the side b vlans on the trunk and watched as pings went to >90% packetloss. Frantically I reverted the b side and decided to switch them, enabling side b and disabling side a’s SRX cluster to see what would happen.
The same thing! Pings were fine, flawless even between the a side SRX cluster and the b side ex cluster. What the heck. For science sake I enabled the side b SRX and confirmed the ping failure was back.
My next step had me playing around in the b side EX cluster, I ran:
While the mac addresses of everything the switch had learned ran past an odd mac address caught my eye:
Odd, right? Seems. Mechanical. So I dug into it, matching all instances of that mac. The first thing I noticed was that this mac appeared on BOTH EX clusters, all the while it contained the EXACT name of the aggregate interface from the side b srx.
I could clear out the arp table and it would still appear, building back up all of the vlans that it was apparently associated with, it just didn’t feel right.
One of the useful things you can do on Juniper equipment (and this might be available on other vendors, i’m sure.), is that by issuing start shell I can drop into a stripped down bsd shell, giving me access to the filesystem and various basic commands including mtr.
I decided I wanted to see what would happen if I had an MTR running between each node. There was something odd about how quickly the systems degraded when I enabled both sides, but it didn’t make sense just looking at a ping. So I disabled the b side srx vlan, and started an mtr from the a side srx to the b side ex watching as everything worked perfectly while showing one hop. (which is expected) I then re-enabled the b side srx and found that the mtr was showing the ip of the b side AND a side srx.
It started making more sense to me, well only a little. I postulated that the issue had to do with something in the cluster, that the mac address from one side was causing the opposite side srx to get confused and to not reply back properly. Seeing both ips in the mtr had the lack of connectivity making more sense, if there was a mac conflict for some reason you bet your going to lose connectivity.
I started googling for dual srx clusters causing issues, or mac address conflict with srx cluster where I found the following.
The guy had the EXACT same issue I was facing, I couldn’t believe it, so I looked over the cluster ids of each cluster and sure enough they both had cluster id of 1. Unbelievable.
In that forum topic they link to a page, here is the one Juniper KB as well as another that talks about the need for unique cluster ids and how you can change said id on each node.
After changing the cluster id of SRX cluster – side b to 2, and rebooting it, I saw the system INSTANTLY pick up, pings between the SRX cluster endpoints were working flawlessly. Working with the coworker who had spent weeks trying to fix ghost issues due to this, couldn’t believe it, such a small thing caused such a HUGE headache. (Such as random vpns flapping because conflicts with our remote sides unrelated to the staging environment).
TL;DR – Make sure if you decide to put more than one SRX cluster on the same layer 2 network, or close to there, change the cluster id! Hell, even just changing it to have a unique id per cluster would be best, juniper gives you a big enough of an integer space! Juniper generates the RETH0 mac address based on the cluster id, so if you have the same cluster id across multiple clusters, your going to have a bad day.