One of my customers increased their VMware Cloud on AWS environment from just a few Nodes to several Clusters over time.
The sizing was made “back in the day”, not only from a compute, but a also a networking perspective – including the needed Direct Connection bandwidth.
Everything worked as expected and everyone was super happy.
One day though, an additional (VDI) cluster was added and first complaints about latencies & longer log in times (among other things) were recorded.
We’ve first checked the DX Link(s) and sure enough, we’ve found a congestion.
Having more and more Nodes in the SDDC and more Users accessing these resources led to higher bandwidth needs.
In this architecture, the Customer was using two 100Mbps DX links in an active/standby configuration, provided by a 3rd party (as a Hosted VIF).
Unfortunately, the link speed of a Hosted VIF connection can’t be changed, meaning, a new DX link needs to be requested, configured and added.
My Customer went ahead and requested 2 new 200Mbps links, configured the AWS side of things and we were able to accept the new connections within the SDDC.
But for some reason, we were unable to use the new links – their BGP Status still stated: “down”.
After double checking all AWS settings (incl. ASN, MTU, IP ranges, ..), we’ve figured it must be related to the on-premises router.
Looking at the BGP status on the on-premises router, it was found that both – the new and the “old” link – had their local preference set to 200.
Ha! Both links had the same preference and were therefore “fighting” over who’s in charge.
Changing the preference from “BGP_SET_LOCPREF_200” to default (100) on the “old” link solved the issue and the SDDC was now accessible via the newly attached DX links.
The secondary link, used as stand-by, was configured with BGP_ASPATH_PREPEND, so this was perfectly fine and obviously did not interfere.
After confirmation that everything is working as expected, the “old” links were removed (“delete” operation within the SDDC, as well as deleted from the AWS console) to reduce costs.
This again shows that sometimes even minor “mis-configurations” can create a bigger impact.
I can only emphasise to double check all settings in advance to a change on the Networking side, to avoid running into a potential loss of connection to the SDDC.
Thank you for reading and I sure hope this helps others, who are also seeing an issue with their BGP status when adding new direct connect links to their SDDC.
To the Cloud! 🙂