Skip to content

deepdoctection/deepdoctection

Repository files navigation

Deep Doctection Logo

GitHub Repo stars PyPI - Version PyPI - License


NEW

Version v.1.0 includes a major refactoring. Key changes include:

  • PyTorch-only support for all deep learning models.
  • Support for many more fine-tuned models from the Huggingface Hub (Bert, RobertA, LayoutLM, LiLT, ...)
  • Decomposition into small sub-packages: dd-core, dd-datasets and deepdoctection
  • Type validations of core data structures
  • New test suite

A Package for Document Understanding

deepdoctection is a Python library that orchestrates Scan and PDF document layout analysis, OCR and document and token classification. Build and run a pipeline for your document extraction tasks, develop your own document extraction workflow, fine-tune pre-trained models and use them seamlessly for inference.

Overview

  • Document layout analysis and table recognition in PyTorch with Detectron2 and Transformers,
  • OCR with support of Tesseract, DocTr and AWS Textract,
  • Document and token classification with the LayoutLM family, LiLT and and many Bert-style models including features like sliding windows.
  • Text mining for native PDFs with pdfplumber,
  • Language detection with with transformer based papluca/xlm-roberta-base-language-detection.
  • Deskewing and rotating images with jdeskew or Tesseract.
  • Fine-tuning object detection, document or token classification models and evaluating whole pipelines.
  • Lot's of tutorials

Have a look at the introduction notebook for an easy start.

Check the release notes for recent updates.


Hugging Face Space Demo

Check the demo of a document layout analysis pipeline with OCR on 🤗 Hugging Face spaces.


Example

The following example shows how to use the built-in analyzer to decompose a PDF document into its layout structures.

import deepdoctection as dd
from IPython.core.display import HTML
from matplotlib import pyplot as plt

analyzer = dd.get_dd_analyzer()  # instantiate the built-in analyzer similar to the Hugging Face space demo

df = analyzer.analyze(path = "/path/to/your/doc.pdf")  # setting up pipeline
df.reset_state()                 # Trigger some initialization

doc = iter(df)
page = next(doc) 

image = page.viz(show_figures=True, show_residual_layouts=True)
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)

sample

HTML(page.tables[0].html)

table

print(page.text)

text


Requirements

requirements

  • Python >= 3.10
  • PyTorch >= 2.6
  • To fine-tune models, a GPU is recommended.

Installation

We recommend using a virtual environment.

Get started installation

For a simple setup which is enough to parse documents with the default setting, install the following

uv pip install timm  # needed for the default setup
uv pip install transformers
uv pip install python-doctr
uv pip install deepdoctection

This setup is sufficient to run the introduction notebook.

Full installation

The following installation will give you a general setup so that you can experiment with various configurations. Remember, that you always have to install PyTorch separately.

First install Detectron2 separately as it is not distributed via PyPi. Check the instruction here or try:

uv pip install --no-build-isolation detectron2@git+https://github.com/deepdoctection/detectron2.git

Then install deepdoctection with all its dependencies:

uv pip install deepdoctection[full]

For further information, please consult the full installation instructions.

Installation from source

Download the repository or clone via

git clone https://github.com/deepdoctection/deepdoctection.git

The easiest way is to install with make. A virtual environment is required

make install-dd

Running a Docker container from Docker hub

Pre-existing Docker images can be downloaded from the Docker hub.

Additionally, specify a working directory to mount files to be processed into the container.

docker compose up -d

will start the container. There is no endpoint exposed, though.


Credits

We thank all libraries that provide high quality code and pre-trained models. Without, it would have been impossible to develop this framework.

If you like deepdoctection ...

...you can easily support the project by making it more visible. Leaving a star or a recommendation will help.

License

Distributed under the Apache 2.0 License. Check LICENSE for additional information.