While some whole heartedly believe in not connecting sites with ANY type of layer 2, and I actually am a bigger believer in that now than I used to be, customers still ask and “require” this occasionally – namely for workload mobility. Any answer I get or anything I read does not actively promote using an overlay such as VXLAN between data centers. The responses are usually around 1. BUM traffic control 2. ARP localization 3. Traffic Trombone (since only one active default gateway) 4. STP isolation. If you want to know all of the typical responses, look at the benefits of OTV. But again, in a world that will soon be eaten by software, why can’t a viable solution be developed for L2 DCI with overlays?
Maybe it’s about having Active/Active default gateways. One of the many attributes of OTV, but this too can be done with various types of FHRP filters. If a controller was local to each data center, had awareness of where each VM was, I still don’t see why it’s not feasible to directly affect forwarding of local hosts to achieve active/active outbound paths (with or without FHRP filters). As it is, ARP proxying is a functionality always talked about by anybody selling a controller (or virtual switch) these days to minimize ARP traffic. Well, for the average mid-size data center, minimizing ARP traffic isn’t top of mind, but if it offers a real business solution for disaster recovery or A/A data centers, someone may listen. In fact, I have never had a customer ask me, “Jason, I am seeing too many ARPs on my network – can you help?” But yes, it will give a cleaner and more efficient network. Remember, the average customer is not Facebook, Goog, or Amazon. However, I do ultimately think they are driving technology forward for the rest of us in a positive way.
Is it about Spanning Tree reduction? If so, why not use the best SDN controller out there to limit BPDUs based on Ethertype with some type of proactive flow instantiation?
Is it about split brain (split subnet) in case a link breaks? Get redundant paths between sites and use a solution like LISP or “fail-open” to host based routing? Not feasible for large DCs? That’s okay – start somewhere.
A question that is still always asked of Network Virtualization and Controller vendors is, “what if your controller fails?” It’s a layup of a question if you ask me. They all pretty much support redundancy, but what type is the real question? Are they active/active? What happen when the original controller comes back online? How synchronous are the databases? Is it a parent/child or pub/sub type relationship? Is it a cluster or simple active/standby?
Is the question and function of controller clustering and synchronization the real reason why we can’t use overlays between data centers (if we must use them at all)? If so, I have two comments on that:
First, hire better database people :). I said already, maybe much can be learned from Meraki because of their platform on the back end. I trust their platform and other SAAS companies have pretty slick and potentially proprietary DB synch techniques. Also, Infoblox comes to mind - they have one of the slickest management platforms and also deploy a two-phased commit to synch between grid members. That is some of their secret sauce. Can this technology be leveraged for something like controller synch? Maybe they can get involved with OpenDaylight.
Second, I’d rather see vendors have requirements for a L2 DCI solution using overlays just like we see in any other given technology. To have a Call Manager cluster between sites, it has latency requirements. To have storage replication between sites, there are other requirements of latency and bandwidth. Many clients may have dedicated links between data centers for some other application requirement. How about the A2Q process Cisco employs for Contact Center? I'm simply saying make it part of a formal planning & design process.
So, is it too easy for vendors to demand requirements to have a functional solution that overs L2 data center interconnect that could enable Active/Active data centers? Selling active/active data centers to senior management is an easier sell than trying to sell them on cool tech.
What do you think?
Happy Friday!
Thanks,
Jason
Twitter: @jedelman8