Summary of Structure your Machine Learning project source code like a pro

  • santiagof.medium.com
  • Article
  • Summarized Content

    Introduction to Project Structure for Python with Jupyter Notebooks

    As Machine Learning (ML) solutions become more mission-critical, the need for robust and high-quality code increases. This article discusses how to structure a Python project with Jupyter notebooks, following software engineering best practices for MLOps (Machine Learning Operations).

    • The programming language (Python) and the ML platform being used influence the project structure.
    • This guide leverages MLFlow, an open-source standard that works well with many platforms.

    Managing Data in a Python Project with Jupyter Notebooks

    Placing data files in a Git repository is not recommended for several reasons, including Git's line-by-line comparison, handling large data files inefficiently, security constraints, and data platform integration.

    • Store a sample of the data or its schema in the repository for debugging and testing purposes.
    • Use tools like Great Expectations to manage data expectations and characteristics.
    • Store data expectations and sample data in a dedicated folder, e.g., `data/great_expectations` and `data/sample`.

    Structuring Source Code for Python with Jupyter Notebooks

    The best practice is to place the source code in a folder called `src`. Jupyter notebooks can be stored in a separate `notebooks` folder within `src`.

    • Use absolute imports for better organization and maintainability.
    • Add an `__init__.py` file to treat folders as Python modules or packages.
    • Follow the "one model = one package" approach for better separation of concerns.

    Calling Code from Python Packages with Jupyter Notebooks

    When running code from a Python package (as opposed to a standalone file), import the required modules from the package. Use a separate Python file or Jupyter notebook as an orchestrator to call the package's code.

    • Use the `jobtools` library for automatic argument parsing from the command line.
    • Pass configuration parameters as function arguments or load them from a file (e.g., YAML).
    • Use the `SimpleNamespace` object to access configuration parameters with dot notation.

    Structuring Unit Tests for Python with Jupyter Notebooks

    Unit tests are essential for verifying the correctness of self-contained code units. Follow the convention of placing tests in a `tests` folder, either at the root level or within the `src` folder.

    • Write unit tests to cover different scenarios and edge cases.
    • Separate the test code from the production code for better organization.

    Additional Folders and Miscellaneous

    Depending on the project's requirements, additional folders may be included in the repository:

    • `.cloud`: Contains infrastructure as code (IaC) resources for cloud deployment.
    • `.github` or `.azure-pipelines`: Contains CI/CD pipelines for MLOps and DevOps automation.
    • `.aml`: Includes resources for experimentation and training on Azure Machine Learning (opinionated).
    • `docs`: Contains project documentation.

    Conclusion

    Structuring a Python project with Jupyter notebooks and MLOps best practices is crucial for maintainability, scalability, and collaboration. This guide covers best practices for managing data, source code organization, argument parsing, configuration files, unit testing, and additional folders for cloud resources and documentation.

    Discover content by category

    Ask anything...

    Sign Up Free to ask questions about anything you want to learn.