Document Layout Analysis Datasets

Here is a list of Document Layout Analysis data sets and the labels which they contain.

Data Set	Labels
DocBank	Abstract, Author, Caption, Equation, Figure, Footer, List, Paragraph, Reference, Section, Table, Title
PubLayNet	Text, Title, List, Table, Figure
DocLayNet	Caption, Footnote, Formula, List-Item, Page-foote, Page-header, Picture, Section-header, Table, Text, Title
GROTOAP2	Abstract, Acknowledgements, Affiliation, Author, Bib_info, Body_content, Conflict_statement, Correspondence, Dates, Editors, Figure, Glossary, keywords, Page_number, References, Table, Title, Title_author, Type, Unknown

These data sets can be used to either train a model from scratch or to fine-tune a pre-trained model.

A pre-trained model could be a general object recognition model like Yolov5 which was trained for object recognition of objects in pictures (ex. cats, dogs, cars, etc), or a more specialized model that was trained specifically to look at documents such as the LayoutLM family of models.

Converting GPOTOAP2 to COCO

The GROTOAP2 dataset is in an old format called TrueViz XML and needs to be converted to a modern format to be used to fine-tune the Yolov5 or LayoutLM models.

Training YoloV5 with Converted GPOTOAP2 Data

python detect.py --weights runs/train/exp7/weights/best.pt --source ../grotoap2_test_images/ --imgsz 1100 1100 --line-thickness 1 --save-txt

python train.py --img 1100 --batch 16 --epoch 3 --data grotoap2_full.yaml --weights yolov5s.pt

Search This Blog

Nicholi Shiell - Notes