Summary of Enhancing Netflix Reliability with Service-Level Prioritized Load Shedding

  • netflixtechblog.com
  • Article
  • Summarized Content

    Introduction to Netflix's Prioritized Load Shedding

    Netflix is committed to providing a seamless streaming experience to millions of users simultaneously. To achieve this goal, they have introduced the concept of prioritized load shedding, initially implemented at the API gateway level and later extended to individual service levels, focusing on the video streaming control plane and data plane.

    • The evolution of load shedding at Netflix began with prioritizing different types of network traffic at the Zuul API gateway, ensuring critical playback requests receive priority over less critical telemetry traffic.
    • Building on this foundation, Netflix recognized the need to apply similar prioritization logic deeper within their architecture, at the service layer, where different types of requests within the same service could be prioritized differently.
    • The advantages of applying prioritization techniques at the service level include:
      • Service teams can own their prioritization logic and apply finer-grained prioritization.
      • It can be used for backend-to-backend communication, not just services behind the edge API gateway.
      • Services can utilize cloud capacity more efficiently by combining different request types into one cluster and shedding low-priority requests when necessary, instead of maintaining separate clusters for failure isolation.

    Introducing Service-Level Prioritized Load Shedding at Netflix

    PlayAPI, a critical backend service on the video streaming control plane, handles device-initiated manifest and license requests necessary to start playback. Netflix categorizes these requests into two types based on their criticality:

    • User-Initiated Requests (critical): These requests are made when a user hits play and directly impact the user's ability to start watching a show or a movie.
    • Pre-fetch Requests (non-critical): These requests are made optimistically when a user browses content without hitting play, to reduce latency should the user decide to watch a particular title. Failure in only pre-fetch requests does not result in a playback failure but slightly increases the latency between pressing play and video appearing on screen.

    Implementing Prioritized Load Shedding in PlayAPI at Netflix

    To handle large traffic spikes, high backend latency, or an under-scaled backend service, Netflix implemented a concurrency limiter within PlayAPI that prioritizes user-initiated requests over prefetch requests without physically sharding the two request handlers. This mechanism uses the partitioning functionality of the open-source Netflix/concurrency-limits Java library.

    • User-Initiated Partition: Guaranteed 100% throughput.
    • Pre-fetch Partition: Utilizes only excess capacity.

    In steady state, there is no throttling, and the prioritization has no effect on the handling of pre-fetch requests. The prioritization mechanism only kicks in when a server is at the concurrency limit and needs to reject requests.

    Testing and Validation at Netflix

    To validate the load-shedding implementation, Netflix used Failure Injection Testing to inject 2-second latency in pre-fetch calls, simulating a scenario where a pre-fetch cluster for a downstream service is experiencing high latency.

    • Without prioritized load-shedding, both user-initiated and prefetch availability dropped when latency was injected.
    • With prioritized load-shedding, user-initiated requests maintained a 100% availability, and only prefetch requests were throttled.

    Real-World Application and Results at Netflix

    During an infrastructure outage at Netflix that impacted streaming for many users, Netflix experienced a 12x spike in pre-fetch requests per second from Android devices after the outage was fixed, presumably due to a backlog of queued requests.

    • Without prioritized load-shedding, this could have resulted in a second outage as the systems were not scaled to handle the traffic spike.
    • With prioritized load-shedding in PlayAPI, while the availability for prefetch requests dropped as low as 20%, the availability for user-initiated requests was > 99.4%.
    • At one point, Netflix was throttling more than 50% of all requests, but the availability of user-initiated requests continued to be > 99.4%.

    Generic Service Work Prioritization at Netflix

    Based on the success of this approach, Netflix created an internal library to enable services to perform prioritized load shedding based on pluggable utilization measures, with multiple priority levels:

    • CRITICAL: Affect core functionality — These will never be shed if the system is not in complete failure.
    • DEGRADED: Affect user experience — These will be progressively shed as the load increases.
    • BEST_EFFORT: Do not affect the user — These will be responded to in a best-effort fashion and may be shed progressively in normal operation.
    • BULK: Background work, expect these to be routinely shed.

    Generic CPU-based Load Shedding at Netflix

    Most services at Netflix autoscale on CPU utilization, so it is a natural measure of system load to tie into the prioritized load shedding framework. Once a request is mapped to a priority bucket, services can determine when to shed traffic from a particular bucket based on CPU utilization.

    • Prioritized shedding only starts shedding load after hitting the target CPU utilization, and as system load increases, more critical traffic is progressively shed in an attempt to maintain user experience.
    • When a traffic spike causes the cluster's CPU utilization to significantly surpass the target threshold, it will gradually shed low-priority traffic to conserve resources for high-priority traffic.
    • This approach allows more time for auto-scaling to add additional instances to the cluster, and once more instances are added, CPU utilization will decrease, and low-priority traffic will resume being served normally.

    Generic IO-based Load Shedding at Netflix

    Some services at Netflix are IO-bound by backing services or datastores that can apply back pressure via increased latency when overloaded. For these services, Netflix reuses the prioritized load shedding techniques, introducing new utilization measures to feed into the shedding logic:

    • The service can specify per-endpoint target and maximum latencies, allowing the service to shed when abnormally slow, regardless of the backend.
    • The Netflix storage services running on the Data Gateway return observed storage target and max latency SLO utilization, allowing services to shed when overloading their allocated storage capacity.

    These utilization measures provide early warning signs that a service is generating too much load to a backend, allowing it to shed low-priority work before overwhelming the backend.

    Anti-patterns to Avoid with Load Shedding at Netflix

    Netflix identified two anti-patterns to avoid with load shedding:

    • Anti-pattern 1 — No shedding: In the absence of load-shedding, increased latency can degrade all requests instead of rejecting some requests (that can be retried), and can make instances unhealthy, leading to a death spiral.
    • Anti-pattern 2 — Congestive failure: If the load-shedding is due to an increase in traffic, the successful RPS should not drop after load-shedding. Shedding too aggressively can lead to congestive failure, preventing any work from being done.

    Future Directions and Conclusion at Netflix

    The implementation of service-level prioritized load shedding has proven to be a significant step forward in maintaining high availability and excellent user experience for Netflix customers, even during unexpected system stress.

    Netflix continues to innovate and enhance its resilience strategies to ensure smooth streaming experiences for its users, no matter the challenges faced.

    Ask anything...

    Sign Up Free to ask questions about anything you want to learn.