NEW

Version v.1.0 includes a major refactoring. Key changes include:

PyTorch-only support for all deep learning models.
Support for many more fine-tuned models from the Huggingface Hub (Bert, RobertA, LayoutLM, LiLT, ...)
Decomposition into small sub-packages: dd-core, dd-datasets and deepdoctection
Type validations of core data structures
New test suite

A Package for Document Understanding

deepdoctection is a Python library that orchestrates Scan and PDF document layout analysis, OCR and document and token classification. Build and run a pipeline for your document extraction tasks, develop your own document extraction workflow, fine-tune pre-trained models and use them seamlessly for inference.

Overview

Document layout analysis and table recognition in PyTorch with Detectron2 and Transformers,
OCR with support of Tesseract, DocTr and AWS Textract,
Document and token classification with the LayoutLM family, LiLT and and many Bert-style models including features like sliding windows.
Text mining for native PDFs with pdfplumber,
Language detection with with transformer based papluca/xlm-roberta-base-language-detection.
Deskewing and rotating images with jdeskew or Tesseract.
Fine-tuning object detection, document or token classification models and evaluating whole pipelines.
Lot's of tutorials

Have a look at the introduction notebook for an easy start.

Check the release notes for recent updates.

Hugging Face Space Demo

Check the demo of a document layout analysis pipeline with OCR on 🤗 Hugging Face spaces.

Example

The following example shows how to use the built-in analyzer to decompose a PDF document into its layout structures.

import deepdoctection as dd
from IPython.core.display import HTML
from matplotlib import pyplot as plt

analyzer = dd.get_dd_analyzer()  # instantiate the built-in analyzer similar to the Hugging Face space demo

df = analyzer.analyze(path = "/path/to/your/doc.pdf")  # setting up pipeline
df.reset_state()                 # Trigger some initialization

doc = iter(df)
page = next(doc) 

image = page.viz(show_figures=True, show_residual_layouts=True)
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)

HTML(page.tables[0].html)

print(page.text)

Requirements

Python >= 3.10
PyTorch >= 2.6
To fine-tune models, a GPU is recommended.

Installation

We recommend using a virtual environment.

Get started installation

For a simple setup which is enough to parse documents with the default setting, install the following

uv pip install timm  # needed for the default setup
uv pip install transformers
uv pip install python-doctr
uv pip install deepdoctection

This setup is sufficient to run the introduction notebook.

Full installation

The following installation will give you a general setup so that you can experiment with various configurations. Remember, that you always have to install PyTorch separately.

First install Detectron2 separately as it is not distributed via PyPi. Check the instruction here or try:

uv pip install --no-build-isolation detectron2@git+https://github.com/deepdoctection/detectron2.git

Then install deepdoctection with all its dependencies:

uv pip install deepdoctection[full]

For further information, please consult the full installation instructions.

Installation from source

Download the repository or clone via

git clone https://github.com/deepdoctection/deepdoctection.git

The easiest way is to install with make. A virtual environment is required

make install-dd

Running a Docker container from Docker hub

Pre-existing Docker images can be downloaded from the Docker hub.

Additionally, specify a working directory to mount files to be processed into the container.

docker compose up -d

will start the container. There is no endpoint exposed, though.

Credits

We thank all libraries that provide high quality code and pre-trained models. Without, it would have been impossible to develop this framework.

If you like deepdoctection ...

...you can easily support the project by making it more visible. Leaving a star or a recommendation will help.

License

Distributed under the Apache 2.0 License. Check LICENSE for additional information.

Name		Name	Last commit message	Last commit date
Latest commit History 1,728 Commits
.github		.github
docker/gpu		docker/gpu
docs		docs
packages		packages
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
mkdocs.yml		mkdocs.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NEW

A Package for Document Understanding

Overview

Hugging Face Space Demo

Example

Requirements

Installation

Get started installation

Full installation

Installation from source

Running a Docker container from Docker hub

Credits

If you like deepdoctection ...

License

About

Uh oh!

Releases 62

Uh oh!

Contributors 12

Languages

License

deepdoctection/deepdoctection

Folders and files

Latest commit

History

Repository files navigation

NEW

A Package for Document Understanding

Overview

Hugging Face Space Demo

Example

Requirements

Installation

Get started installation

Full installation

Installation from source

Running a Docker container from Docker hub

Credits

If you like deepdoctection ...

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 62

Uh oh!

Contributors 12

Languages