MinerU: Open-Source PDF Data Extraction Tool

Summary of Changelog

news.ycombinator.com

HN Threads

Summarized Content

MinerU: An Open Source PDF Extraction Tool

MinerU is an open source tool designed to convert PDFs into machine-readable formats, such as markdown and JSON. This conversion process allows for seamless extraction of valuable data from PDFs, simplifying the process of utilizing this information for various purposes.

MinerU's development originated during the pre-training phase of InternLM, a large language model project. It addresses the challenge of accurately converting symbols found in scientific literature, making it an invaluable tool for research and development in the field of large models.

It's designed for research and development in the field of large models.
It aims to contribute to the era of large models like InternLM.

Key Features of MinerU

MinerU boasts a collection of powerful features designed to enhance the PDF extraction process and provide users with comprehensive data insights.

Removes extraneous elements like headers, footers, footnotes, and page numbers, ensuring semantic continuity while preserving the document's core content.
Preserves the original structure of the document, including titles, paragraphs, and lists, maintaining the integrity of the information.
Extracts images, image captions, tables, and table captions, capturing all relevant visual and tabular data.
Automatically recognizes formulas in the document and converts them to LaTeX, simplifying the handling of complex mathematical expressions.
Automatically recognizes tables in the document and converts them to LaTeX, facilitating analysis and manipulation of tabular data.
Automatically detects and enables OCR for corrupted PDFs, ensuring data recovery from imperfect documents.
Supports both CPU and GPU environments, optimizing processing speed for different hardware configurations.
Supports Windows, Linux, and Mac platforms, offering wide accessibility for users across various operating systems.

Quick Start: Experience the Power of MinerU

Getting started with MinerU is straightforward, with several methods available to suit different needs. Refer to the FAQ for troubleshooting any installation issues and the Known Issues section if you encounter unexpected parsing results.

Online Demo: A Hands-on Introduction

Experience the capabilities of MinerU with the interactive online demo. It provides a quick and easy way to explore the tool's functionalities and gain an understanding of its potential.

Click here to access the online demo: Online Demo

Quick CPU Demo: Step-by-Step Guide

This guide demonstrates how to use MinerU on your local machine utilizing a CPU environment. It's a simple and efficient way to start extracting data from PDFs without requiring a GPU.

1. Install magic-pdf

The first step is to install the magic-pdf package, which is the core component of MinerU.

conda create -n MinerU python=3.10
conda activate MinerU
pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com

2. Download Model Weight Files

Model weight files are essential for MinerU's operation. Download these files from the repository's documentation.

Click here for detailed instructions on downloading model files: How to Download Model Files

3. Copy and Configure the Template File

A template configuration file (magic-pdf.template.json) is provided in the repository's root directory. Copy this file to your user directory for correct execution.

Execute the following command to copy the configuration file: cp magic-pdf.template.json ~/magic-pdf.json
Locate the magic-pdf.json file in your user directory and configure the "models-dir" path to point to the directory where the model weight files were downloaded.
Ensure the "model-dir" value correctly reflects the absolute path to the model weight files directory.

Command Line: Utilizing MinerU's Power

Utilize the command line to extract data from PDFs using MinerU. This method offers flexibility and control over the extraction process.

magic-pdf --help
Usage: magic-pdf [OPTIONS]

Options:
  -v, --version                display the version and exit
  -p, --path PATH              local pdf filepath or directory  [required]
  -o, --output-dir TEXT        output local directory
  -m, --method [ocr|txt|auto]  the method for parsing pdf.
                               ocr: using ocr technique to extract information from pdf,
                               txt: suitable for the text-based pdf only and outperform ocr,
                               auto: automatically choose the best method for parsing pdf
                                  from ocr and txt.
                               without method specified, auto will be used by default.
  --help                       Show this message and exit.

# show version
magic-pdf -v

# command line example
magic-pdf -p {some_pdf} -o {some_output_dir} -m auto

{some_pdf} represents either a single PDF file or a directory containing multiple PDFs.
The results will be saved in the {some_output_dir} directory.

Known Issues and Limitations

While MinerU is a robust tool, certain limitations exist, and continuous development is ongoing to address these:

Reading order is segmented based on rules, which can cause disordered sequences in some cases.
Vertical text is not supported.
Lists, code blocks, and table of contents are not yet supported in the layout model.
Comic books, art books, elementary school textbooks, and exercise books are not well-parsed yet.
Enabling OCR may produce better results in PDFs with a high density of formulas.

Table Recognition

Table recognition is currently under development. Recognition speed is slow, and accuracy requires improvement. The table below provides performance test results for reference.

Table Size	Parsing Time
6*5 55kb	37s
16*12 284kb	3m18s
44*7 559kb	4m12s

Contributing to MinerU

MinerU is an open source project, and contributions from the community are highly valued. If you encounter any issues or have suggestions for improvements, please submit an issue on the GitHub repository.

Report issues and contribute: Issue Tracker

License Information

MinerU is licensed under the Apache 2.0 License. This permissive license encourages the use, modification, and distribution of the project.

Acknowledgments

The development of MinerU was made possible through the collaboration and efforts of several individuals and organizations. We acknowledge and express our gratitude to all contributors, including the InternLM team.

Citation

To properly cite MinerU in your work, please use the following bibliographic information.

@article{he2024opendatalab,
  title={Opendatalab: Empowering general artificial intelligence with open datasets},
  author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua},
  journal={arXiv preprint arXiv:2407.13773},
  year={2024}
}

@misc{2024mineru,
    title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
    author={MinerU Contributors},
    howpublished = {\url{https://github.com/opendatalab/MinerU}},
    year={2024}
}

Star History

Track the growth and popularity of MinerU over time with the Star History chart.

Related Projects

Explore other open source projects developed by the team behind MinerU.

Magic-Doc: A high-speed extraction tool for ppt/pptx/doc/docx/pdf files. Magic-Doc
Magic-HTML: A mixed web page extraction tool. Magic-HTML

View Original Content

Discover content by category

.NET

.NET Porting

.com Domain

.gov Websites

.tech Domains

1+1=11

1-Man Business Model

10Xer Club Podcast

18th Century

1984 Anti-Sikh Riots

View all →

Ask anything...