Summary of Round 2: A Survey of Causal Inference Applications at Netflix

  • netflixtechblog.com
  • Article
  • Summarized Content

    Netflix's Focus on Causal Inference and Experimentation

    At Netflix, every effort is made to ensure that both present and future members can find content that excites them, keeping them coming back for more. This mission is greatly aided by data science and engineering, with causal inference playing a crucial role. Netflix heavily relies on both experimentation and quasi-experimentation to aid its teams in making optimal decisions that contribute to member satisfaction.

    Metrics Projection for Growth A/B Tests at Netflix

    Netflix's core principle is experimentation, and whenever a new product feature is introduced, they utilize A/B test results (where applicable) to estimate the annualized incremental impact on the business.

    • This estimation process traditionally involved manual forecasting of signups, retention probabilities, and cumulative revenue for a year, using monthly cohorts. This process, while effective, was repetitive and time-consuming.
    • Netflix developed a faster, automated approach that relies on estimating two key pieces of missing data:
      • Unobserved billing periods: Treatment effects for periods beyond the initial two billing periods are projected forward using a surrogate-based approach, leveraging Netflix's retention model and short-term observations.
      • Unobserved signup cohorts: Assuming transportability, the treatment effects from the first observed cohort are applied to the remaining cohorts, effectively estimating their impact.

    A Systematic Framework for Evaluating Game Events

    Netflix Games DSE often encounters causal inference questions regarding the impact of interventions such as product changes or player acquisition campaigns on game performance.

    • Traditional A/B tests are not always feasible due to the nature of these interventions. Therefore, Netflix developed a framework based on variations of synthetic control, employing game-level or country-level interventions with limited data.
    • The framework encompasses various synthetic control (SC) models, including:
      • Augmented SC
      • Robust SC
      • Penalized SC
      • Synthetic difference-in-differences
    • A scale-free metric evaluates the performance of each model, selecting the one that minimizes pre-treatment bias. Robustness tests, backdating, and inference measures based on the number of control units are also employed.

    Double Machine Learning for Weighing Metrics Tradeoffs

    As Netflix ventures into new business verticals, metric tradeoffs in A/B tests become more frequent. To aid decision-makers in navigating such scenarios, Netflix developed a method using Double Machine Learning (DML) to compare the relative importance of different metrics (treatments) in terms of their causal effect on the north-star metric (Retention).

    • The method involves discretizing each treatment into bins and fitting a multiclass propensity score model, enabling the estimation of multiple Average Treatment Effects (ATEs) using Augmented Inverse-Propensity-Weighting (AIPW) to reflect different treatment contrasts.
    • These treatment effects are then weighted by the baseline distribution, providing an apples-to-apples ranking of treatments based on their ATE on the same overall population.

    Survey A/B Tests with Heterogeneous Non-Response Bias

    Netflix leverages survey A/B tests to improve the quality and reach of its survey research, directly testing and validating ideas such as incentive structures, subject lines, message design, and timing of invitations.

    • The challenge lies in non-response bias, where the intervention might distort upstream metrics, affecting the downstream guardrail metrics designed to measure data quality.
    • To address this, Netflix employs a combination of techniques, focusing on conditional average treatment effects (CATE) for specific sub-populations where confounding covariates are balanced.
    • Propensity scores are used to correct for internal validity issues, while iterative proportional fitting addresses external validity concerns, ensuring the surveys accurately represent member opinions.

    Design: The Intersection of Humans and Technology

    Design is crucial for Netflix's experimentation platform, as it defines how the product functions and presents data to internal users involved in A/B testing. Thoughtful design choices are crucial for enabling users to take action and effectively interpret data, ultimately impacting decision-making.

    • Different visualization methods, such as tables, pie charts, stacked bar charts, and bar charts, can be used to effectively convey information about numbers, parts of a whole, progress toward a goal, or comparisons. The choice of presentation depends on the desired takeaway.
    • Thoughtful design extends to interactive experiences, such as pre-filled values or default values with the ability to edit them, ensuring user-friendliness and ease of understanding.
    • Design influences product strategy at Netflix, ensuring that tools effectively facilitate learning from experiments and contribute to a positive member experience.

    External Speaker: Kosuke Imai

    Netflix's Causal Inference and Experimentation Summit also featured renowned scholar Kosuke Imai, who introduced the "cram method", a powerful and efficient approach for learning and evaluating treatment policies utilizing generic machine learning algorithms.

    Importance of Causal Inference at Netflix

    Causal inference is deeply ingrained in Netflix's data science culture. The summit served as a platform to celebrate the work of its dedicated colleagues who leverage both experimentation and quasi-experimentation to drive member impact.

    • The conference highlighted the value that causal methodologies bring to the business, emphasizing their role in making informed decisions and achieving positive outcomes.

    Ask anything...

    Sign Up Free to ask questions about anything you want to learn.