Summary of Video annotator: building video classifiers using vision-language models and active learning

  • netflixtechblog.com
  • Article
  • Summarized Content

    The Challenge of High-Quality Video Annotations at Netflix

    At Netflix, we leverage machine learning to power various video-related features, such as search and discovery, personalization, and promotional assets. However, building robust machine learning models requires high-quality and consistent annotations. Traditional methods for training machine learning classifiers are resource-intensive and time-consuming, often involving domain experts, data scientists, and third-party annotators.

    • Annotating large datasets is a challenge, especially for subjective tasks.
    • Third-party annotators may lack a deep understanding of the model's intended deployment or usage, leading to inconsistent labeling.
    • This can result in model drift and a lengthy iteration cycle, affecting model performance and user trust.

    Introducing Netflix's Video Annotator (VA): A Human-in-the-Loop Solution

    To address these challenges, Netflix developed Video Annotator (VA), a novel framework that leverages active learning techniques and zero-shot capabilities of large vision-language models. VA empowers domain experts to directly participate in the annotation process, improving efficiency and reducing costs.

    • VA guides users to focus their efforts on progressively harder examples, maximizing the model's sample efficiency.
    • It seamlessly integrates model building into the annotation process, allowing users to validate the model before deployment.
    • VA supports a continuous annotation process, enabling users to rapidly deploy models, monitor their quality in production, and quickly fix edge cases.
    • This self-service architecture empowers domain experts to make improvements without active involvement of data scientists.

    VA's Three-Step Process for Building Video Classifiers

    VA utilizes a three-step process to build video classifiers, empowering users to annotate, manage, and iterate on video classification datasets. The process involves:

    • Search: Users leverage text-to-video search powered by vision-language models to find an initial set of relevant video clips, bootstrapping the annotation process.
    • Active Learning: VA actively selects examples for annotation, guiding users towards the most informative instances. This includes feeds for top-scoring positive and negative examples, borderline cases, and random selections, enabling efficient exploration of the data.
    • Review: The final step involves reviewing all annotated clips, allowing users to spot annotation mistakes and identify potential new areas for annotation.

    Active Learning: Guided Annotation for Efficiency

    At the heart of VA lies active learning, a technique where a machine learning model iteratively selects the most informative examples for annotation. By focusing on challenging or uncertain instances, VA significantly reduces the need for manual labeling, enhancing the model's efficiency and reducing annotation costs.

    • VA uses video embeddings generated from a vision-language model to represent video clips, enabling efficient scoring and selection.
    • The system presents users with various feeds: top-scoring positive and negative examples, borderline cases, and random selections.
    • This approach allows users to quickly identify patterns, biases, and edge cases in the training data, leading to higher-quality annotations.

    Video Understanding through Extensible Video Classifiers

    VA enables the creation of an extensible set of binary video classifiers, each focusing on a specific video understanding label. This approach allows for granular analysis of video content, capturing diverse aspects of the video, such as visuals, concepts, and events.

    • Each classifier is trained on a specific video understanding task, such as identifying establishing shots or detecting specific actions.
    • The combination of multiple classifiers provides a deeper understanding of the video content, capturing various levels of granularity.
    • VA's modular design allows for easy addition or improvement of individual models without impacting others, fostering flexibility and scalability.

    Experimentation and Results

    Netflix conducted experiments to evaluate VA's performance, comparing it to baseline methods. VA consistently outperformed the baselines, demonstrating its ability to achieve higher-quality video classifiers with fewer annotations. The results showed that VA effectively guides users to label the most informative examples, leading to significant improvements in model performance.

    • Three video experts annotated a diverse set of 56 labels across a video corpus of 500,000 shots.
    • VA achieved a median 8.3 point improvement in Average Precision compared to the most competitive baseline.
    • VA's active learning strategy significantly reduced the number of annotations required for achieving high-quality video classifiers.

    Building Trust through User Involvement

    VA empowers domain experts to be directly involved in the model building process, fostering a sense of ownership and trust. This active involvement leads to more accurate annotations and a deeper understanding of the model's capabilities and limitations.

    • Domain experts gain insights into the challenges and intricacies of video classification, contributing to the development of more reliable models.
    • Users can directly validate the model before deployment, ensuring its accuracy and alignment with their needs.
    • VA's user-friendly interface makes the annotation process accessible to a wider range of users, fostering collaboration and knowledge sharing.

    Conclusion: Netflix's Video Annotator

    Netflix's Video Annotator (VA) addresses the challenges of traditional annotation methods, offering a human-in-the-loop solution that leverages active learning and zero-shot capabilities of large vision-language models. VA significantly enhances sample efficiency, reduces costs, and empowers domain experts to build robust and reliable video classifiers for various video understanding tasks.

    • VA's intuitive interface and active learning strategy make it easy to use, enabling users to rapidly deploy models and iteratively improve them.
    • The framework fosters collaboration between domain experts and data scientists, building trust and ownership in the model building process.
    • Netflix has publicly released a dataset and code for VA, allowing other researchers and developers to explore its potential and contribute to the field of video understanding.

    Ask anything...

    Sign Up Free to ask questions about anything you want to learn.