Summary of The Making of VES: the Cosmos Microservice for Netflix Video Encoding

  • netflixtechblog.com
  • Article
  • Summarized Content

    Netflix’s Video Encoding Service: A Microservice Journey

    This blog post delves into the intricate world of video encoding at Netflix, specifically focusing on the development and implementation of their Video Encoding Service (VES) using a microservice architecture. This is the second installment in a multi-part series detailing Netflix’s efforts to rebuild their entire video processing pipeline with microservices. The first blog provided an overview of their approach; this post dives deeper into the specific design and lessons learned from creating the VES, emphasizing the power of microservices in a complex environment like Netflix’s vast streaming platform.

    The Foundation: Cosmos Platform and Microservices

    The article highlights the central role of Netflix’s Cosmos platform in this modernization effort. Cosmos, a next-generation media computing platform, combines a microservice architecture with asynchronous workflows and serverless functions to modernize Netflix’s media processing pipelines. The goal is to enhance flexibility, efficiency, and developer productivity while seamlessly supporting the ever-growing demands of their global streaming service.

    • Cosmos offers a streamlined service generator that creates basic yet complete services with code repositories for all three layers (API, workflow, and computing), enabling rapid prototyping and deployment.
    • It leverages asynchronous workflows orchestrated by Timestone, a priority-based messaging system, for robust communication between service layers.

    VES Architecture: Three Pillars

    The VES is composed of three layers, each serving a distinct function within the video processing pipeline.

    Optimus: The API Gateway

    Optimus acts as the API layer, providing a clear interface for external interaction with VES. This decoupling shields users from internal changes, allowing for faster iterations and independent development of VES internals.

    • VES exposes an encodeVideo endpoint that accepts an EncodeRequest containing source video information and encoding recipe details. The response is sent asynchronously through Timestone messages.
    • The EncodeRequest includes parameters for codec format, resolution, chunking directives, and other encoding specifications.

    Plato: The Workflow Orchestrator

    Plato governs the media processing workflow using a Directed Acyclic Graph (DAG) paradigm, allowing for parallel encoding of video chunks to meet the latency and resilience requirements of Netflix streaming.

    • The DAG is composed of five nodes: Splitter, Encoder, Assembler, Validator, and Notifier.
    • The Splitter divides the video into chunks, while the Encoder nodes (Map nodes) process these chunks in parallel.
    • The Assembler node (Reduce node) stitches the encoded chunks together, followed by validation and notification.

    Stratum: The Serverless Computing Layer

    Stratum provides the serverless computing environment where the actual media processing takes place, leveraging Docker containers that run specific encoding functions. These containers are managed by Titus, a platform that automatically scales instances based on job queue depths, ensuring optimal resource utilization.

    • Multiple Stratum functions are created for different codec formats, allowing independent updates and upgrades without impacting other codecs.
    • The Cosmos platform abstracts away complexities of media access and persistence, simplifying development and deployment.
    • VES utilizes container shaping, defining various resource allocations based on encoding recipe requirements, to achieve “bin packing” and maximize efficiency.

    Continuous Release: Embracing Iteration

    Netflix’s VES was designed to support continuous releases, enabling rapid iteration and feature updates while maintaining service stability and performance. This approach leverages the flexibility of microservices and a robust testing framework, ensuring that code changes are thoroughly tested and deployed seamlessly.

    • The release pipeline is fully automated, taking around 30 minutes from code merge to deployment.
    • The system relies on metrics and logs for monitoring, alerting, and automatic rollback, ensuring high availability and resilience.

    Lessons Learned from Building VES

    This journey of building the VES provided valuable insights into the design and implementation of microservices in a large-scale production environment like Netflix’s. Several key lessons were learned:

    Define a Proper Service Scope

    The initial approach of creating a separate encoding service for each codec format proved unsustainable due to development overhead. This led to the consolidation of encodings into a single service, reducing code duplication while maintaining independent codec evolution.

    Be Pragmatic about Data Modeling

    The initial focus on strict data model separation proved overly complex, requiring numerous conversions. Creating a library for common data models across services and defining service-specific models struck a more practical balance.

    Embrace Service API Changes

    While maintaining API stability is crucial, it’s equally important to be able to evolve APIs as needs change. Collaboration and communication between teams are essential for smooth API transitions.

    Conclusion: Continuous Innovation with Microservices

    Netflix’s Video Encoding Service stands as a testament to the power of microservices in tackling complex video processing challenges. By embracing continuous releases and embracing the flexibility of the Cosmos platform, Netflix has created a system that can adapt to evolving streaming demands and continue to deliver exceptional video experiences to its global audience.

    Discover content by category

    Ask anything...

    Sign Up Free to ask questions about anything you want to learn.