MinerU is an open source tool designed to convert PDFs into machine-readable formats, such as markdown and JSON. This conversion process allows for seamless extraction of valuable data from PDFs, simplifying the process of utilizing this information for various purposes.
MinerU's development originated during the pre-training phase of InternLM, a large language model project. It addresses the challenge of accurately converting symbols found in scientific literature, making it an invaluable tool for research and development in the field of large models.
MinerU boasts a collection of powerful features designed to enhance the PDF extraction process and provide users with comprehensive data insights.
Getting started with MinerU is straightforward, with several methods available to suit different needs. Refer to the FAQ for troubleshooting any installation issues and the Known Issues section if you encounter unexpected parsing results.
Experience the capabilities of MinerU with the interactive online demo. It provides a quick and easy way to explore the tool's functionalities and gain an understanding of its potential.
This guide demonstrates how to use MinerU on your local machine utilizing a CPU environment. It's a simple and efficient way to start extracting data from PDFs without requiring a GPU.
The first step is to install the magic-pdf package, which is the core component of MinerU.
conda create -n MinerU python=3.10 conda activate MinerU pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com
Model weight files are essential for MinerU's operation. Download these files from the repository's documentation.
A template configuration file (magic-pdf.template.json) is provided in the repository's root directory. Copy this file to your user directory for correct execution.
cp magic-pdf.template.json ~/magic-pdf.json
Utilize the command line to extract data from PDFs using MinerU. This method offers flexibility and control over the extraction process.
magic-pdf --help Usage: magic-pdf [OPTIONS] Options: -v, --version display the version and exit -p, --path PATH local pdf filepath or directory [required] -o, --output-dir TEXT output local directory -m, --method [ocr|txt|auto] the method for parsing pdf. ocr: using ocr technique to extract information from pdf, txt: suitable for the text-based pdf only and outperform ocr, auto: automatically choose the best method for parsing pdf from ocr and txt. without method specified, auto will be used by default. --help Show this message and exit. # show version magic-pdf -v # command line example magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
While MinerU is a robust tool, certain limitations exist, and continuous development is ongoing to address these:
Table recognition is currently under development. Recognition speed is slow, and accuracy requires improvement. The table below provides performance test results for reference.
Table Size | Parsing Time |
---|---|
6*5 55kb | 37s |
16*12 284kb | 3m18s |
44*7 559kb | 4m12s |
MinerU is an open source project, and contributions from the community are highly valued. If you encounter any issues or have suggestions for improvements, please submit an issue on the GitHub repository.
MinerU is licensed under the Apache 2.0 License. This permissive license encourages the use, modification, and distribution of the project.
The development of MinerU was made possible through the collaboration and efforts of several individuals and organizations. We acknowledge and express our gratitude to all contributors, including the InternLM team.
To properly cite MinerU in your work, please use the following bibliographic information.
@article{he2024opendatalab, title={Opendatalab: Empowering general artificial intelligence with open datasets}, author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua}, journal={arXiv preprint arXiv:2407.13773}, year={2024} } @misc{2024mineru, title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool}, author={MinerU Contributors}, howpublished = {\url{https://github.com/opendatalab/MinerU}}, year={2024} }
Track the growth and popularity of MinerU over time with the Star History chart.
Explore other open source projects developed by the team behind MinerU.
Ask anything...