Summary of Changelog

  • news.ycombinator.com
  • HN Threads
  • Summarized Content

    MinerU: An Open Source PDF Extraction Tool

    MinerU is an open source tool designed to convert PDFs into machine-readable formats, such as markdown and JSON. This conversion process allows for seamless extraction of valuable data from PDFs, simplifying the process of utilizing this information for various purposes.

    MinerU's development originated during the pre-training phase of InternLM, a large language model project. It addresses the challenge of accurately converting symbols found in scientific literature, making it an invaluable tool for research and development in the field of large models.

    • It's designed for research and development in the field of large models.
    • It aims to contribute to the era of large models like InternLM.

    Key Features of MinerU

    MinerU boasts a collection of powerful features designed to enhance the PDF extraction process and provide users with comprehensive data insights.

    • Removes extraneous elements like headers, footers, footnotes, and page numbers, ensuring semantic continuity while preserving the document's core content.
    • Preserves the original structure of the document, including titles, paragraphs, and lists, maintaining the integrity of the information.
    • Extracts images, image captions, tables, and table captions, capturing all relevant visual and tabular data.
    • Automatically recognizes formulas in the document and converts them to LaTeX, simplifying the handling of complex mathematical expressions.
    • Automatically recognizes tables in the document and converts them to LaTeX, facilitating analysis and manipulation of tabular data.
    • Automatically detects and enables OCR for corrupted PDFs, ensuring data recovery from imperfect documents.
    • Supports both CPU and GPU environments, optimizing processing speed for different hardware configurations.
    • Supports Windows, Linux, and Mac platforms, offering wide accessibility for users across various operating systems.

    Quick Start: Experience the Power of MinerU

    Getting started with MinerU is straightforward, with several methods available to suit different needs. Refer to the FAQ for troubleshooting any installation issues and the Known Issues section if you encounter unexpected parsing results.

    Online Demo: A Hands-on Introduction

    Experience the capabilities of MinerU with the interactive online demo. It provides a quick and easy way to explore the tool's functionalities and gain an understanding of its potential.

    Quick CPU Demo: Step-by-Step Guide

    This guide demonstrates how to use MinerU on your local machine utilizing a CPU environment. It's a simple and efficient way to start extracting data from PDFs without requiring a GPU.

    1. Install magic-pdf

    The first step is to install the magic-pdf package, which is the core component of MinerU.

    conda create -n MinerU python=3.10
    conda activate MinerU
    pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com

    2. Download Model Weight Files

    Model weight files are essential for MinerU's operation. Download these files from the repository's documentation.

    3. Copy and Configure the Template File

    A template configuration file (magic-pdf.template.json) is provided in the repository's root directory. Copy this file to your user directory for correct execution.

    • Execute the following command to copy the configuration file: cp magic-pdf.template.json ~/magic-pdf.json
    • Locate the magic-pdf.json file in your user directory and configure the "models-dir" path to point to the directory where the model weight files were downloaded.
    • Ensure the "model-dir" value correctly reflects the absolute path to the model weight files directory.

    Command Line: Utilizing MinerU's Power

    Utilize the command line to extract data from PDFs using MinerU. This method offers flexibility and control over the extraction process.

    magic-pdf --help
    Usage: magic-pdf [OPTIONS]
    
    Options:
      -v, --version                display the version and exit
      -p, --path PATH              local pdf filepath or directory  [required]
      -o, --output-dir TEXT        output local directory
      -m, --method [ocr|txt|auto]  the method for parsing pdf.
                                   ocr: using ocr technique to extract information from pdf,
                                   txt: suitable for the text-based pdf only and outperform ocr,
                                   auto: automatically choose the best method for parsing pdf
                                      from ocr and txt.
                                   without method specified, auto will be used by default.
      --help                       Show this message and exit.
    
    # show version
    magic-pdf -v
    
    # command line example
    magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
    • {some_pdf} represents either a single PDF file or a directory containing multiple PDFs.
    • The results will be saved in the {some_output_dir} directory.

    Known Issues and Limitations

    While MinerU is a robust tool, certain limitations exist, and continuous development is ongoing to address these:

    • Reading order is segmented based on rules, which can cause disordered sequences in some cases.
    • Vertical text is not supported.
    • Lists, code blocks, and table of contents are not yet supported in the layout model.
    • Comic books, art books, elementary school textbooks, and exercise books are not well-parsed yet.
    • Enabling OCR may produce better results in PDFs with a high density of formulas.

    Table Recognition

    Table recognition is currently under development. Recognition speed is slow, and accuracy requires improvement. The table below provides performance test results for reference.

    Table Size Parsing Time
    6*5 55kb 37s
    16*12 284kb 3m18s
    44*7 559kb 4m12s

    Contributing to MinerU

    MinerU is an open source project, and contributions from the community are highly valued. If you encounter any issues or have suggestions for improvements, please submit an issue on the GitHub repository.

    License Information

    MinerU is licensed under the Apache 2.0 License. This permissive license encourages the use, modification, and distribution of the project.

    Acknowledgments

    The development of MinerU was made possible through the collaboration and efforts of several individuals and organizations. We acknowledge and express our gratitude to all contributors, including the InternLM team.

    Citation

    To properly cite MinerU in your work, please use the following bibliographic information.

    @article{he2024opendatalab,
      title={Opendatalab: Empowering general artificial intelligence with open datasets},
      author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua},
      journal={arXiv preprint arXiv:2407.13773},
      year={2024}
    }
    
    @misc{2024mineru,
        title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
        author={MinerU Contributors},
        howpublished = {\url{https://github.com/opendatalab/MinerU}},
        year={2024}
    }

    Star History

    Track the growth and popularity of MinerU over time with the Star History chart.

    Star History Chart

    Related Projects

    Explore other open source projects developed by the team behind MinerU.

    • Magic-Doc: A high-speed extraction tool for ppt/pptx/doc/docx/pdf files. Magic-Doc
    • Magic-HTML: A mixed web page extraction tool. Magic-HTML

    Discover content by category

    Ask anything...

    Sign Up Free to ask questions about anything you want to learn.