Since there aren’t that many articles out there on the subject, I figured I’d share what I was able to find.
Architecting a Network for Hadoop – by Stuart Miniman
Considering 10GE Hadoop clusters and the network – by Brad Hedlund
Big Data in the Enterprise: Network Design Considerations – by Cisco
Dell Force10 Hadoop Network – The Scale out Solution – by Dell (looks like more of Brad’s work)
Network Design Considerations for Hadoop ‘Big Data Clusters’ and the Hadoop File System – by Arista
Arista – Hadoop Cluster Applications – by Arista (couldn’t find this one again because I first found this several months ago, so it is attached right below).
For those other SDN aficionados out there, you may like this next one (combines optical switching, SDN, and big data):
Programming Your Network at Run-time for Big Data Applications by IBM T.J Watson Research Center and Rice University. Sounds like Plexxi would be a good fit here.
This is not meant to be an exhaustive list of all resources out there, but just a starting point. I highly recommend reading Brad’s first and then even reading some general Hadoop whitepapers that aren’t focused on the network to really understand the application itself. As you can see from the documents above, the vendors still recommend their same switches for Hadoop environments, but it is important to be knowledgeable as a network person when interfacing with the application folks. You should always try and be relevant and do your best to understand the applications that ride over the network. That’s why we are building the networks anyway.
For those stepping into a meeting in the next hour and need a few high level bullet points, here you go:
- Understand amount of servers, amount per rack, and how many NICs per server are to be used
- Understand overall growth
- It’s common to deploy TOR switches – either 1 or 2 per rack based on server density and importance of the cluster to the business (just like in other parts of the DC)
- Hadoop has no dependencies on Layer 2 connectivity – use layer 3 whenever possible
- Small deployments (just a few racks) can take advantage of L3 in the aggregation layer
- Larger deployments will have L3 down to the TOR switch
- 1GE is most common in today’s Hadoop environments, but 10GE is gaining traction
- Hadoop environments are largely all bare metal – no virtualization
- Build out to be non-blocking whenever possible; large amounts of data will be shuffled across the network should there be a “rack” failure, e.g. single TOR switch fails
- Take into consideration the traffic is bursty – use switches with optimized buffers
Follow me on Twitter: @jedelman8