Podcast Summary - Scaling Data Center Networks | Between 0x2 Nerds
Podcast Summary - Scaling Data Center Networks | Between 0x2 Nerds
Table of Contents
Podcast - Scaling Data Center Networks - Dmitry Afanasiev
Summary:
- ποΈ Introduction to Scaling in Data Centers: Hosts introduce the topic of scaling challenges in data centers, emphasizing this as a series of topics to be discussed over multiple episodes.
- π» Guest Speaker - Dmitry Afanasiev: Dmitry, a network architect from Yandex, explains the challenges Yandex faces in scaling their networks, which are similar to hyperscalers like Google or Amazon.
- ποΈ Challenges in Scaling: The discussion covers issues such as failure domain limitation, MPLS in data centers, and managing large data center networks with advanced technologies like RIFT.
- π‘ Networking Technologies: Dmitry explains technologies like MPLS with label distribution via ARP and delves into Yandexβs use of RIFT (Routing In Fat Trees) to improve scaling.
- π Scaling Strategies: Different network architectures are discussed, such as leaf-spine topologies and the use of high-radix switches, which improve scalability without adding significant levels of complexity.
- π§ In-Network Compute: The potential of in-network compute, especially in machine learning, where operations like collective communication can be offloaded to the network devices, improving performance.
- π Large Data Centers & Metrics: Dmitry mentions Yandex’s goal of scaling to 100k endpoints in a single data center, discussing metrics like tail latency, failure tolerance, and efficient resource utilization.
- π Energy and Power Constraints: Data centers face power limitations as they grow, with Dmitry noting that Yandexβs large data centers are capped at around 50 megawatts.
- π Future of Data Center Networking: Discussion of advanced network topologies beyond traditional Clos networks, with interest in technologies like Dragonfly and global adaptive routing to improve performance.
Insights Based on Numbers:
- 100k endpoints: Yandex is aiming to scale its data centers to 100,000 endpoints, a huge challenge in terms of both network design and power consumption.
- 50 megawatts: Power constraints limit large data centers to approximately 50 MW per facility, leading to the need for multiple distributed campuses.
- Latency improvements: In-network compute could significantly improve latency by reducing data transfers within the network infrastructure.
Yandex’s approach to scaling data centers, in comparison to other hyperscalers like Google or Amazon, reveals some key similarities and differences:
-
Similar Challenges: Like Google and Amazon, Yandex faces typical hyperscaler challenges such as managing massive networks, failure domains, and optimizing resource utilization. These include addressing network congestion, scaling limitations, and the complexity of distributed applications running across thousands of servers.
-
Unique Innovations: Yandex, despite being smaller in scale compared to Google or Amazon, has been highly innovative. One example is their early adoption of MPLS (Multi-Protocol Label Switching) in the data center, which is not commonly used by all hyperscalers. They also experimented with distributing labels using ARP (Address Resolution Protocol), showcasing their ability to push forward network design for scaling.
-
Scaling Constraints: While Google and Amazon have larger global footprints, Yandex focuses on similar-sized scaling, with a target of 100k endpoints per data center, limited by energy consumption to around 50 MW. However, Yandex is expanding beyond Russia into markets in the U.S. and Europe, potentially closing the gap in terms of global presence.
Thus, Yandex aligns with other hyperscalers in terms of scaling strategies but differentiates itself with some innovative networking solutions specific to its operational needs.
Using high-radix switches in modern data centers emphasizes both their benefits and challenges:
Benefits:
-
Improved Scalability: High-radix switches, which have more ports per switch, allow for larger network topologies without needing to add additional layers of switches. This enables data centers to scale more efficiently, supporting more endpoints with fewer devices.
-
Reduced Complexity: By using fewer network levels, high-radix switches simplify the network architecture. This reduces the complexity associated with managing additional spine and leaf layers, making the overall design more efficient and easier to maintain.
-
Better Load Distribution: With more ports, high-radix switches allow for better distribution of traffic across the network. This helps prevent congestion and ensures that data moves smoothly through the system, improving network performance.
Challenges:
-
Cost and Power: High-radix switches are more expensive and consume more power due to the larger number of ports and the need for advanced hardware. In large-scale data centers, this can become a significant cost factor, especially when multiplied across thousands of switches.
-
Complexity in Cabling: Managing the physical cabling of high-radix switches is challenging, especially in dense environments. The increased number of ports requires careful planning to avoid messy or inefficient cabling setups.
-
Latency Concerns: While high-radix switches reduce the need for multiple layers in the network, they may introduce higher latency if not managed properly, particularly when dealing with longer paths or larger distances within the data center.
Overall, high-radix switches provide significant advantages in terms of scalability and efficiency, but they also introduce challenges related to cost, power, and physical complexity.
about in-network compute and its impact on machine learning workloads in data centers focuses on the following points:
-
Reduced Latency: In-network compute allows some parts of collective operations (e.g., reductions and aggregations) to be handled directly within the network hardware. This reduces the amount of data that needs to be moved between servers, cutting down on latency and improving the overall speed of machine learning tasks.
-
Optimized Bandwidth Use: By performing computations like collective communication in-network, the need for back-and-forth data transfers between nodes is minimized. This leads to more efficient use of network bandwidth, especially in data-intensive operations like training AI models, where large amounts of data are typically exchanged.
-
Improved Performance: Offloading tasks to network devices allows compute resources (like GPUs) to focus on core machine learning operations, while the network handles supporting tasks. This separation of concerns leads to better performance in large-scale machine learning tasks, as both the compute and network layers are optimized for their respective roles.
In-network compute is particularly beneficial for distributed machine learning environments, where reducing traffic and improving coordination between GPUs is crucial for scaling workloads efficiently.
Podcast - Scaling Data Center Networks - Part 2-Dmitry Afanasiev
Summary:
- π Continuation on Data Center Scaling: The session picks up where the first discussion on scaling left off, discussing the topologies and designs used in hyperscalers like Yandex, including Clos or folded-Clos topologies with three stages.
- π Use of Multiple Planes: Yandex uses eight planes for traffic distribution, where each plane is independent of the others, ensuring better fault tolerance and traffic management.
- π‘ Cabling Infrastructure: Dmitry elaborates on the physical constraints of data centers, describing the need for optimized cable management to deal with the long distances between equipment. Different segments use a mix of single-mode optics, multi-mode optics, and twin-ax copper cables depending on distance.
- π οΈ Challenges of Future Optics: As data speeds increase to 800Gb, multi-mode optics will become less effective, forcing a move to single-mode optics for longer links. Additionally, within racks, middle-of-the-rack switches will optimize cable lengths and efficiency.
- ποΈ Managing Complexity: The growing complexity of cable and switch management requires careful planning and hierarchical design to keep bundles of cables consistent and reduce physical complexity.
- π§ Cross-Connects and Space Utilization: Dmitry explains how optical cross-connects are used to manage cable infrastructure within data centers, providing a visual description of how physically large these setups can be.
- π₯ Power and Cooling Requirements: Cooling and power density requirements for network devices are growing, with switches requiring separate environments from the compute infrastructure to manage airflow and heat more effectively.
- π Routing and Advertisement: Yandex uses a simple, prefix-based routing scheme, avoiding aggregation where possible to reduce routing complexity. IPv6 is used to allocate large, stable prefixes within the data center to minimize the need for dynamic management.
- π§βπ» Weighted ECMP: Weighted Equal-Cost Multi-Path (ECMP) routing is discussed as an approach to improving traffic distribution. However, due to the limited vendor support at the time, Yandex decided against using it.
Insights Based on Numbers:
- 8 independent planes: Yandex’s data center design utilizes eight separate planes for traffic distribution, improving redundancy and fault isolation.
- Up to 100 meters between switch segments: Physical distance in data centers is a key challenge, with single-mode optics becoming more common as distances exceed what multi-mode optics can handle.
- 16,000 prefixes: Yandex manages up to 16,000 prefixes in its routing tables, relying on IPv6 to allocate large address blocks and reduce routing overhead.
Yandex’s management of cable infrastructure in large data centers highlights several key strategies for optimizing space and reducing complexity:
-
Segmented Cabling: Yandex divides its cabling into three segments:
- The first segment connects from the host to the top-of-rack switches.
- The second segment runs between the top-of-rack switches and the spine switches.
- The third segment links spine switches to higher layers in the topology, such as super-spines.
These segments help organize the physical layout and prevent cabling issues that can arise due to long distances between components.
-
Cable Types: Different cable technologies are used depending on the distance:
- Twin-axial copper cables are used for short distances within racks.
- Multi-mode optical fibers are employed for medium distances, but as speeds increase (e.g., 800Gb), multi-mode optics are becoming less effective.
- Single-mode optics are used for long distances, ensuring high data transfer speeds across hundreds of meters in the data center.
-
Hierarchical Bundling: To manage the complexity of having thousands of cables, Yandex groups cables into bundles that are easier to manage and route. These bundles connect similar endpoints and travel consistent paths, which simplifies both the physical installation and future upgrades or repairs.
-
Optical Cross-Connects: To further reduce complexity, Yandex uses optical cross-connects, which act as an intermediary to manage cable bundles. This allows for better organization of the physical infrastructure and simplifies the connection between different network layers.
In sum, Yandex tackles the challenges of large data center infrastructure with careful planning of cabling segments, varied cable types, and hierarchical designs that reduce physical complexity and improve manageability.
about the challenges hyperscalers face in cooling and powering their network infrastructure as they scale reveals several key points:
-
High-Density Power Consumption: As data centers scale, especially for network infrastructure, the power requirements become substantial. Hyperscalers like Yandex deal with high-power densities in switches and other network equipment, which can consume a large portion of the available power within a data center. This is particularly challenging because switches are often located separately from compute resources, and need their own dedicated power feeds.
-
Cooling Requirements: Network equipment, especially high-radix switches, generates significant heat due to the high density of transceivers and cables, blocking airflow through the devices. Unlike compute servers, switches donβt have as much open space for airflow, making effective cooling harder. Yandex uses separate rooms with enhanced cooling systems to manage this heat, isolating network equipment from other data center components to ensure adequate airflow and temperature control.
-
Distributed Placement for Redundancy: To mitigate risks and ensure network reliability, Yandex places critical network equipment like super-spine switches in different locations across the data center. This separation ensures that if one power zone or cooling system fails, the other sections remain operational, improving overall network resilience.
-
Free-Space Cooling: Yandex also implements free-space cooling for their servers, allowing temperatures in the data centerβs cold corridor to rise as high as 35Β°C (95Β°F). However, the network devices require better cooling environments than what is provided for standard servers, further complicating infrastructure scaling.
The combination of increasing power needs, the challenge of effective cooling, and the need for physical separation to prevent cascading failures highlights the significant obstacles hyperscalers face in scaling their data center network infrastructure.
why Yandex chose not to implement Weighted Equal-Cost Multi-Path (ECMP) routing highlights both the challenges and potential benefits of using this traffic distribution method:
Reasons for Not Implementing Weighted ECMP:
-
Vendor Support Limitations: At the time Yandex considered using weighted ECMP, vendor support for the feature was limited and inconsistent. Yandexβs network environment involved multiple vendors, and implementing weighted ECMP across different vendors' hardware presented technical challenges. The immature support made the feature unreliable for large-scale deployment.
-
Increased Complexity: Weighted ECMP introduces additional complexity in network routing. It requires more advanced configurations and adds a layer of traffic management that can be difficult to maintain in large, multi-vendor environments. Yandex preferred to keep their routing system simpler, avoiding unnecessary complexity.
-
Coarse Granularity: Implementing weighted ECMP with available chipsets at the time resulted in coarse granularity for traffic distribution. This meant that the distribution of traffic between paths was not fine-tuned enough to provide meaningful improvements. For example, balancing traffic across eight uplinks did not offer sufficient flexibility for certain network loads, limiting the effectiveness of weighted ECMP.
Potential Benefits of Weighted ECMP:
-
Optimized Traffic Balancing: When properly supported and implemented, weighted ECMP allows more intelligent traffic distribution based on link capacity. This helps to avoid overloading specific links while underutilizing others, improving overall network efficiency.
-
Reduced Congestion: By distributing traffic based on weights, network congestion can be mitigated, especially in scenarios where certain paths have more bandwidth or lower latency than others.
-
Scalability: For large data centers, weighted ECMP offers the potential for more scalable traffic management by dynamically adjusting traffic flows as network conditions change.
In conclusion, while weighted ECMP offers clear benefits for traffic distribution, Yandex avoided implementing it due to immature vendor support, added complexity, and the coarse granularity of available implementations at the time.
Podcast - Scaling Data Center Networks - Part 3-Dmitry Afanisiev
Summary:
- ποΈ Introduction to Networking Discussion: Jeff Doyle and co-host Jeff Tancera introduce the topic of scaling data centers, focusing on networking technologies like ECMP and MPLS. They discuss how these technologies are evolving to handle large-scale data centers.
- π‘ MPLS in Data Centers: The team discusses the IETF draft for Labeled ARP (LARP), which allows hosts to receive both an IP address and an MPLS label upon boot. This improves efficiency by enabling immediate traffic encapsulation in MPLS.
- ποΈ Scaling and ECMP: The video examines how ECMP (Equal-Cost Multi-Path Routing) works well in data centers unless flows exceed 40% of upstream bandwidth. AI and machine learning workloads, which have long-lasting flows (e.g., 100Gb or 400Gb), pose unique challenges.
- π EVPN and Control Planes: For enterprise and hyperscale environments, different control planes are used. In hyperscalers, EVPN (Ethernet VPN) isn’t widely deployed, whereas the enterprise relies on EVPN over VXLAN for network overlays.
- β‘ Traffic Engineering in AI Workloads: As data center workloads evolve, particularly in AI, traffic engineering becomes more complex. Hyperscalers face challenges as flows grow longer and require more sophisticated routing.
- π Routing and Convergence: Dmitry explains how BGP convergence in multi-plane architectures can lead to temporary routing inefficiencies, stressing the importance of large buffer spaces and careful traffic balancing to avoid congestion.
- π Disaggregation and Aggregation Challenges: The team discusses challenges with disaggregation, particularly in handling multi-homing and managing black holes in large, multi-path networks. They explore the difficulty of aggregating routes when failures occur.
Insights Based on Numbers:
- 50% chance of wrong path: If MPLS routing relies on default routes during failure, there’s a 50% chance of picking the wrong path due to topology asymmetry.
- 1-2 millisecond RTT: For data centers interconnecting at 100 kilometers, round-trip time can range from 1 to 2 milliseconds.
- 100,000+ route scale: Modern BGP routing in data centers supports more than 100,000 routes, allowing extensive scaling of network fabrics.
about MPLS (Multi-Protocol Label Switching) improving traffic encapsulation in large-scale data centers emphasizes several key benefits:
-
Immediate Label Assignment: Through the Labeled ARP (LARP) draft discussed at the IETF, MPLS allows a host to receive both an IP address and an MPLS label during boot-up. This enables the host to immediately start encapsulating its traffic in MPLS, eliminating delays and increasing efficiency. It ensures that traffic is routed through MPLS from the start, without needing additional setup steps.
-
Efficient Traffic Engineering: MPLS supports precise traffic engineering, which is essential in large-scale data centers where traffic flows can be complex and varied. By using MPLS labels, data centers can direct traffic along pre-defined paths, optimizing bandwidth and minimizing congestion.
-
Scalability and Network Segmentation: MPLS allows data centers to scale their networks more effectively. By encapsulating traffic with labels, MPLS facilitates the creation of logical network segments that can be managed independently. This improves overall network organization and makes it easier to scale without causing routing or traffic issues.
In summary, MPLS significantly improves how large data centers manage and route traffic, providing better scalability, efficiency, and flexibility for handling complex workloads.
about the challenges of ECMP (Equal-Cost Multi-Path Routing) in handling AI and machine learning traffic highlights a few critical issues:
-
Long-Lasting Flows: In AI and machine learning workloads, data flows are not typical short bursts but rather long-lasting, high-bandwidth flows. For instance, during inference or training tasks, flows can utilize 100Gb or even 400Gb of bandwidth. ECMP, which works well in traditional traffic scenarios, struggles with these longer, heavier flows, as they can overwhelm specific paths, leading to congestion.
-
Flow Size Disparities: AI workloads often generate disproportionate flows. ECMP is designed to balance traffic evenly across multiple paths. However, when a few flows consume a significant portion of the available bandwidth, ECMP may not distribute traffic effectively, resulting in underutilization of some paths while others become overloaded.
-
Limited Path Diversity: In ECMP, multiple paths must be available and have the same cost to the destination. However, in AI data centers, these long flows can monopolize specific paths, reducing path diversity. As AI workloads grow, it becomes harder to find equal-cost paths, which limits ECMP’s effectiveness in large-scale environments.
These challenges make ECMP less effective for AI and machine learning workloads, pushing data centers to explore alternative routing methods to handle the unique demands of these high-throughput, long-duration traffic patterns.
how disaggregation affects network resilience and routing in data centers highlights several complexities:
-
Black Holes and Route Failure: When a network is disaggregated, meaning individual routes or paths are handled independently, failures in one part of the network can create routing black holes. This happens when aggregated routes fail to propagate correctly, leading to destinations that become unreachable. This is a major concern when dealing with multi-path architectures, especially in large-scale data centers.
-
Symmetry in Aggregated Routes: Disaggregated networks rely on route aggregation to simplify routing tables and reduce overhead. However, if the network loses its symmetry (e.g., through hardware failure or path inconsistencies), these aggregated routes may no longer be valid, forcing routers to rely on disaggregated paths that are less efficient or harder to manage.
-
Increased Routing Complexity: While disaggregation provides flexibility and allows for more granular control over network traffic, it also increases routing complexity. Disaggregating paths means routers must handle more specific routes, which can put extra strain on the networkβs routing tables and make it harder to quickly recover from failures.
Overall, disaggregation can enhance flexibility but can introduce challenges in maintaining network resilience and routing efficiency, particularly when failures or asymmetric paths are involved.
Podcast - Scaling Data Center Networks - Part 4-Dmitry Afanisiev
Summary:
- π Scaling Data Center Networks: The discussion opens with an overview of scaling issues in data center networks, focusing on advanced, non-traditional topologies and network models that could improve efficiency.
- π Machine Learning Workloads: Machine learning clusters are introduced as a critical area requiring new network designs, as their large, synchronized data flows demand intelligent routing and congestion control.
- π Advanced Network Topologies: Non-shortest path routing, Dragonfly Plus, and Slim Fly topologies are explored for their potential to reduce network diameter and improve efficiency for large-scale networks.
- π Data Synchronization Challenges: The importance of synchronized data transfers between GPUs in machine learning models, highlighting issues such as flow size, packet reordering, and buffer constraints.
- π₯οΈ In-Network Compute: A key concept where data aggregation and processing occur within the network itself, reducing load on end devices and improving overall performance in machine learning tasks.
Insights Based on Numbers:
- π‘ 100 gig per flow: This statistic showcases the magnitude of data being transferred in machine learning workloads, emphasizing the need for high-capacity, low-latency networks.
- π‘ 50 tera switches: The emergence of high-performance switches capable of handling massive data volumes (50 terabits per second) marks a significant milestone in data center evolution.
- π‘ 800 gig interfaces: Future network advancements will include 800-gigabit interfaces, drastically increasing throughput and reducing bottlenecks in data-intensive environments.
What the video says about how advanced topologies like Dragonfly Plus improve network efficiency:
Dragonfly Plus is highlighted as a scalable network topology that improves efficiency by reducing the overall network diameter. In traditional Clos networks, the diameter may require multiple hops to transmit data, increasing latency. However, Dragonfly Plus reduces this diameter, meaning data travels fewer hops, speeding up communication.
By allowing communication within fewer network layers, Dragonfly Plus enhances performance, particularly in high-performance computing (HPC) and machine learning clusters where large volumes of synchronized data transfers occur. This improvement reduces the time spent in data synchronization between nodes, essential for large machine learning tasks requiring heavy parallelism.
Dragonfly Plus also allows the use of fewer resources (nodes and links) while maintaining large-scale network sizes, making it both cost-effective and efficient for complex network environments.
about the main challenges in achieving synchronized data transfers between GPUs in machine learning clusters:
One of the key challenges in synchronized data transfers between GPUs is packet reordering. In machine learning tasks, particularly during model training, the data flows between GPUs are large and highly synchronized. Even slight packet reordering can trigger significant slowdowns, as GPUs rely on receiving data in a precise sequence to proceed with computations.
Another challenge is high sensitivity to delays. Since different GPUs must wait for others to complete their tasks before proceeding, any delay from one node can stall the entire process, leading to inefficiencies. This issue is compounded by the fact that machine learning workloads involve many interconnected processes, where a delay in any part of the network halts the whole system.
Finally, network saturation and buffering issues arise due to the high volume of data being transferred. GPUs in modern clusters can handle data at rates of 100 gigabits per second or more, easily saturating the available network capacity and creating bottlenecks that slow down overall computation. Efficient congestion control and adaptive routing mechanisms are critical to managing this load effectively.
about how in-network compute can reduce the load on end devices in large-scale networks:
In-network compute refers to the processing of data directly within the network devices, such as switches, instead of relying solely on end devices like servers or GPUs. This reduces the load on those end devices by offloading tasks such as data aggregation and combination to the network itself.
For machine learning tasks, where large volumes of data need to be exchanged and synchronized between multiple nodes, in-network compute helps by performing intermediate computations during data transfers. For example, instead of sending raw data back and forth between nodes for combination, the network switch can aggregate data in real-time. This minimizes the number of hops and reduces the time spent waiting for data synchronization, ultimately speeding up the process.
Moreover, this approach decreases the overall demand on node-level memory and processing power, since network devices handle some of the workload. This can lead to faster and more efficient operations, particularly in high-performance computing (HPC) environments where computational intensity is high.
LEAVE A REPLY