Marker: PDF to Markdown Converter

Summary of Marker

github.com

Article

Summarized Content

PDF Conversion Deep Learning API

Marker: A High-Speed PDF Converter

Marker is a revolutionary tool designed for quick and accurate conversion of PDF files into markdown, JSON, and HTML formats. This powerful PDF converter utilizes a pipeline of deep learning models to achieve impressive speed and accuracy. The core functionality revolves around converting PDF files efficiently, offering multiple output options.

Supports a wide variety of PDF documents.
Handles all languages with ease.
Intelligently removes headers, footers, and other unwanted artifacts from PDFs.

How Marker Processes PDF Files

The process for converting PDFs begins with extracting text using OCR (Optical Character Recognition) if needed, employing the Surya library for this crucial step. The layout of the PDF is then analyzed to determine the optimal reading order. This layout analysis is also handled by Surya. Finally, the extracted blocks of text are cleaned and formatted (using Texify and Tabled libraries), before being combined and post-processed to produce the final output. This multi-stage approach ensures high-quality results for any PDF input.

Text extraction with OCR (Surya).
Page layout detection (Surya).
Block cleaning and formatting (Texify, Tabled).
Final text combination and post-processing.

Performance Benchmarks of the PDF Converter

The efficiency of the PDF conversion is noteworthy. The provided benchmarks showcase the speed and accuracy of the system. The converter utilizes GPU, CPU, or MPS acceleration which boosts the processing speed considerably. The benchmarks show the average conversion time per page and overall conversion times of different PDF documents. Results were obtained using approximately 7GB of VRAM on an A10 GPU. For detailed benchmarks, refer to the GitHub repository.

High speed PDF conversion.
Supports GPU, CPU, and MPS.
Detailed benchmarks available.

Marker's PDF Conversion Output Formats

Marker provides flexibility by offering three different output formats: markdown, JSON, and HTML. Each format is carefully structured to preserve the original layout and content of the input PDF. The markdown output includes image links, formatted tables, LaTeX equations, and formatted code blocks. HTML output mirrors this structure using appropriate HTML tags.

Markdown: Includes image links, tables, LaTeX, and code blocks.
JSON: Hierarchical structure representing document elements.
HTML: Images via img tags, equations using math tags, code in pre tags.

The JSON Output Structure for PDF Conversion

The JSON output provides a tree-like structure, making it ideal for further processing or integration into other systems. Each page is represented as a block, containing children blocks representing individual elements such as text spans, images, tables, or equations. This structure, detailed in `marker/schema/__init__.py`, allows for precise reconstruction of the original PDF's layout and content. The structure includes metadata such as page polygons and section hierarchies.

Hierarchical JSON structure.
Each page is a block with child blocks for elements.
Metadata includes page geometry and section hierarchy.

Extending Marker: Customizing PDF Conversion

Marker's architecture is designed for extensibility. Users can customize the conversion process by overriding processors, creating new renderers, or adding support for new input formats. The core components—providers, builders, processors, renderers, schema, and converters—are well-defined, making it straightforward to integrate custom logic and enhance existing features. This extensibility ensures that the system can adapt to diverse PDF structures and user requirements. PyTorch is a core dependency for its deep learning capabilities.

Extensible architecture using providers, builders, processors, and renderers.
Easy customization of processing behavior and output formats.
Supports adding new input formats.
Relies on PyTorch for deep learning functionalities.

API and Installation of the PDF Converter

Marker offers both command-line interfaces for single and multiple file conversions, as well as a simple API server. The API server allows for programmatic conversion through HTTP requests. The library can be installed using pip and requires Python 3.10+ and PyTorch. The OCR functionality utilizes the Surya library, while the formatting and other elements depend on Texify and Tabled.

Command-line tools for single and batch PDF conversion.
Simple API server for programmatic access.
Requires Python 3.10+, PyTorch, and several other libraries.
Uses the Surya library for OCR conversion.

Limitations and Future Developments of the PDF Conversion Tool

While Marker strives for perfection, certain limitations exist. Complex layouts with nested tables and forms may pose challenges. The current version also doesn't optimally convert forms and sometimes misinterprets tables and equations. Further development will address these known limitations, constantly improving the quality and capabilities of this versatile PDF converter. The roadmap for future development is discussed on the Discord server.

Known limitations: complex layouts, forms, tables, equations.
Roadmap for improvements available on Discord.

License and Commercial Use of the PDF Converter and API

Marker is designed to be widely accessible while ensuring the sustainability of its development. While research and personal use are freely permitted, commercial usage is subject to certain restrictions. Model weights are licensed under `cc-by-nc-sa-4.0`, but exceptions are made for organizations meeting specific revenue and funding criteria. For details on commercial licensing and removing GPL requirements, consult Datalab's website. A hosted API is also available through Datalab, providing a robust and scalable solution for PDF conversion needs.

Open-source with restrictions on commercial use.
Hosted API available via Datalab.
Licensing details available on the Datalab website.

View Original Content

Discover content by category

.NET

.NET Porting

.com Domain

.gov Websites

.tech Domains

1+1=11

1-Man Business Model

10Xer Club Podcast

18th Century

1984 Anti-Sikh Riots

View all →

Ask anything...