Summary of Curbing Connection Churn in Zuul

  • netflixtechblog.com
  • Article
  • Summarized Content

    The Problem: Excessive Connections in Netflix's Zuul

    Netflix's Zuul gateway, designed for handling requests and responses, initially assumed connections were essentially free. However, the growth of streaming services and the adoption of mTLS (mutual TLS) led to a significant increase in connections, putting a strain on the system.

    • Each event loop maintained a connection pool for every origin server, resulting in a multiplication of connections.
    • For example, a 16-core box connecting to 800 origin servers resulted in 12,800 connections, escalating to 1,280,000 connections across a 100-instance Zuul cluster.

    HTTP/2 Multiplexing: Reusing Connections for Efficiency

    The first step towards improving the connection overhead was implementing HTTP/2 (H2) multiplexing. This technique enables multiple requests to be sent over a single connection, significantly reducing the number of connections required.

    • Zuul, though capable of H2 proxying, didn't support multiplexing before the modifications.
    • The changes involved modifying the H2 connection bootstrap to create a stream and release the connection back into the pool, allowing for reuse.
    • Ideally, each event loop would converge to one connection per origin server, but the sheer scale of origin clusters hindered full utilization of multiplexing.

    Subsetting: Dividing Origin Servers for Balanced Connections

    To address the limitations of multiplexing, the article introduced a new approach: connection subsetting. This involved dividing origin servers into subsets, with each event loop responsible for connecting to only a subset of the origins.

    • This approach allowed for significant reduction in total connections while maintaining even traffic distribution.
    • The article described the Ringsteady algorithm, based on low-discrepancy sequences, which ensured balanced distribution of origin servers within subsets.
    • This method proved effective in handling both steady-state traffic and connection spikes under load.

    Integrating Subsetting into Zuul's Architecture

    Implementing subsetting involved integrating with Eureka service discovery changes to dynamically create and manage subsets based on origin server registrations and changes.

    • Zuul used a modified choice-of-2 load balancing algorithm, operating on a subset of nodes assigned to each event loop.
    • The replication factor of subsets across event loops was adjusted based on the size of the origin server cluster, ensuring sufficient coverage for availability.
    • An ideal subset size of 25-50 nodes was aimed for, providing a balance between connection reduction and load balancing.

    Subsetting's Impact: Dramatic Connection Reduction and Improved Performance

    The results of subsetting implementation were remarkable, demonstrating significant improvements in connection counts, churn, and overall system performance.

    • Total connections at peak were reduced by a factor of 10x across all regions.
    • Churn, measured as the number of TCP connections opened per second, also decreased dramatically.
    • CPU utilization, heap usage, and latency on Zuul were also reduced, indicating improved efficiency.

    Conclusion: Subsetting as a Valuable Optimization Strategy for Netflix

    The article concluded that connection subsetting proved to be a valuable optimization strategy for Netflix's Zuul gateway, achieving a dramatic reduction in connections and improving overall system performance without compromising resiliency or load balancing. This approach, combined with HTTP/2 multiplexing, represents a successful solution for managing connections in large-scale distributed systems.

    Key Takeaways

    The article highlighted several key takeaways for optimizing connections in distributed systems:

    • Leverage HTTP/2 Multiplexing: Reuse existing connections for multiple requests to reduce connection overhead, especially in environments with mTLS.
    • Implement Subsetting: Divide origin servers into subsets to distribute connections evenly across event loops, minimizing the total number of connections and churn.
    • Utilize Low-Discrepancy Sequences: Employ algorithms based on low-discrepancy sequences, like Ringsteady, for stable and balanced distribution of origin servers across subsets.
    • Adjust Replication Factor: Dynamically adjust the replication factor of subsets based on the size of the origin server cluster, ensuring sufficient coverage and availability.
    • Consider a Centralized Load Balancer: Explore the use of a centralized load balancer to optimize load distribution across subsets, though this might introduce complexity.

    Discover content by category

    Ask anything...

    Sign Up Free to ask questions about anything you want to learn.