netflix machine learning

Summary of Recommending for Long-Term Member Satisfaction at Netflix

netflixtechblog.com

Article

Summarized Content

Here is the suggested Meta Title, Meta Description, and detailed Summary for the given article:

Introduction

Netflix's mission is to entertain the world, and their personalization algorithms play a crucial role in recommending the right content to each user. The goal is to create an experience that brings lasting enjoyment to members, not just short-term engagement.

Recommendations as Contextual Bandit

Netflix views recommendations as a contextual bandit problem, where:
- User visit is the context
- System selects recommendations to show (action)
- User provides feedback signals (reward)
Reward functions can be defined based on user feedback signals like clicks, completions, thumbs up/down, etc.
A contextual bandit policy can be trained on historical data to maximize expected reward.

Improving Recommendations: Models and Objectives

While better input features, architectures, or more data can improve recommendation models, this post focuses on refining the objective (reward function) to better reflect long-term user satisfaction.

Retention as Reward?

Optimizing for retention directly has several drawbacks:
- Noisy - influenced by external factors
- Low sensitivity - only for members about to cancel
- Hard to attribute - may cancel after a series of bad recommendations
- Slow to measure - one signal per account per month

Proxy Rewards

Instead, define a proxy reward r(user, item) as a function of user interaction with the recommended item.
Examples:
- Click-through rate (CTR) or play-through rate - simple but may promote clickbait
- Fast season completion with thumbs up - strong sign of enjoyment
- Thumbs down after completion - low satisfaction despite engagement
- Discovering new genres - valuable for expanding interests

Reward Engineering

Reward engineering is the iterative process of refining the proxy reward to align with long-term satisfaction.
Involves hypothesis formation, defining new proxy reward, training new bandit policy, and A/B testing.

Challenge: Delayed Feedback

User feedback used in proxy rewards is often delayed (e.g. taking weeks to complete a show).
Waiting too long for delayed feedback makes the bandit policy stale and ineffective for new items.
Solution: Predict missing delayed feedback based on observed feedback and other relevant information.
Two types of ML models:
- Delayed Feedback Prediction Models - Predict p(final feedback | observed feedbacks) used for computing proxy rewards in bandit policy training
- Bandit Policy Models - Used for online recommendation generation

Challenge: Online-Offline Metric Disparity

Better offline metrics don't always translate to improved online metrics (long-term satisfaction).
Caused by proxy reward not fully aligning with long-term satisfaction.
Solution: Refine proxy reward definition to better align with improved model.

Summary and Open Questions

Netflix focuses on defining proxy rewards aligned with long-term user satisfaction, and handling delayed feedback.
Open questions:
- Can proxy rewards be learned automatically by correlating behavior with retention?
- How long to wait for delayed feedback before using predicted values?
- Can reinforcement learning further align recommendations with long-term satisfaction?

View Original Content

Discover content by category

.NET

.NET Porting

.com Domain

.gov Websites

.tech Domains

1+1=11

1-Man Business Model

10Xer Club Podcast

18th Century

1984 Anti-Sikh Riots

View all →

Ask anything...