Present greatest networking structure utilized by main hyperscaler datacenters? I am significantly speaking in regards to the hyperscalers which can be optimized for enormous distributed ai coaching utilizing greater than 50,000 gpus or so.
Is Backbone and leaf structure the perfect structure for datacenter with enormous east-west site visitors?
How do they join 50K and even 100K+ gpus (just lately in XAI’s datacenter by elon musk and his workforce) on a single community?
Backbone and leaf structure requires each backbone change to attach with each leaf (High-Of-Rack) change and vice versa.
However how do you join leaf switches which can be past the variety of ports backbone change has?
The best way to join 100k+ Nvidia GPUs with all backbone switches?
I’m not capable of perceive this.
Typical useful resource on the web exhibits backbone and leaf structure like beneath. They’re simply connecting as many leaf with backbone. What if there are extra leaf than the ports in backbone change?
I did some analysis and got here to this analysis paper.
Use of BGP for Routing in Giant-Scale Knowledge Facilities
Is that this the identical structure the hyperscaler cloud offers use?
I’m making an attempt to design datacenter structure myself that may be deployable past 100k+ gpus in a single enormous facility (for studying goal.🙂). I couldn’t discover any useful resource on how to do this.
So, I’m in search of reply on following questions.
- The best way to join 10k+ racks (100k+ gpus inside it) in backbone and leaf structure dealing with enormous east-west site visitors whether it is? If not, please point out it.
- How is cable administration accomplished? I’ve seen NVIDIA DGX superpod picture
- They’ve compute node and administration node.
- How do they join this in cluster? Say I need to join 10 superpod, how do they join these superpods in backbone and leaf structure? (Any concept on what number of wires do they join from one superpod to a different?)