Summary of Recommending for Long-Term Member Satisfaction at Netflix

  • netflixtechblog.com
  • Article
  • Summarized Content

    Here is the suggested Meta Title, Meta Description, and detailed Summary for the given article:

    Introduction

    Netflix's mission is to entertain the world, and their personalization algorithms play a crucial role in recommending the right content to each user. The goal is to create an experience that brings lasting enjoyment to members, not just short-term engagement.

    Recommendations as Contextual Bandit

    • Netflix views recommendations as a contextual bandit problem, where:
      • User visit is the context
      • System selects recommendations to show (action)
      • User provides feedback signals (reward)
    • Reward functions can be defined based on user feedback signals like clicks, completions, thumbs up/down, etc.
    • A contextual bandit policy can be trained on historical data to maximize expected reward.

    Improving Recommendations: Models and Objectives

    While better input features, architectures, or more data can improve recommendation models, this post focuses on refining the objective (reward function) to better reflect long-term user satisfaction.

    Retention as Reward?

    • Optimizing for retention directly has several drawbacks:
      • Noisy - influenced by external factors
      • Low sensitivity - only for members about to cancel
      • Hard to attribute - may cancel after a series of bad recommendations
      • Slow to measure - one signal per account per month

    Proxy Rewards

    • Instead, define a proxy reward r(user, item) as a function of user interaction with the recommended item.
    • Examples:
      • Click-through rate (CTR) or play-through rate - simple but may promote clickbait
      • Fast season completion with thumbs up - strong sign of enjoyment
      • Thumbs down after completion - low satisfaction despite engagement
      • Discovering new genres - valuable for expanding interests

    Reward Engineering

    • Reward engineering is the iterative process of refining the proxy reward to align with long-term satisfaction.
    • Involves hypothesis formation, defining new proxy reward, training new bandit policy, and A/B testing.

    Challenge: Delayed Feedback

    • User feedback used in proxy rewards is often delayed (e.g. taking weeks to complete a show).
    • Waiting too long for delayed feedback makes the bandit policy stale and ineffective for new items.
    • Solution: Predict missing delayed feedback based on observed feedback and other relevant information.
    • Two types of ML models:
      • Delayed Feedback Prediction Models - Predict p(final feedback | observed feedbacks) used for computing proxy rewards in bandit policy training
      • Bandit Policy Models - Used for online recommendation generation

    Challenge: Online-Offline Metric Disparity

    • Better offline metrics don't always translate to improved online metrics (long-term satisfaction).
    • Caused by proxy reward not fully aligning with long-term satisfaction.
    • Solution: Refine proxy reward definition to better align with improved model.

    Summary and Open Questions

    • Netflix focuses on defining proxy rewards aligned with long-term user satisfaction, and handling delayed feedback.
    • Open questions:
      • Can proxy rewards be learned automatically by correlating behavior with retention?
      • How long to wait for delayed feedback before using predicted values?
      • Can reinforcement learning further align recommendations with long-term satisfaction?

    Discover content by category

    Ask anything...

    Sign Up Free to ask questions about anything you want to learn.