Document Layout Analysis Datasets

 Here is a list of Document Layout Analysis data sets and the labels which they contain. 

Data Set Labels
DocBank
    Abstract, Author, Caption, Equation, Figure,
    Footer, List, Paragraph, Reference, Section,
    Table, Title
PubLayNet
    Text, Title, List, Table, Figure
DocLayNet
    Caption, Footnote, Formula, List-Item, Page-foote,
    Page-header, Picture, Section-header, Table, Text,
    Title
GROTOAP2
    Abstract, Acknowledgements, Affiliation, Author, Bib_info,
    Body_content, Conflict_statement, Correspondence, Dates, Editors,
    Figure, Glossary, keywords, Page_number,
    References, Table, Title, Title_author, Type,
    Unknown

 These data sets can be used to either train a model from scratch or to fine-tune a pre-trained model.

A pre-trained model could be a general object recognition model like Yolov5 which was trained for object recognition of objects in pictures (ex. cats, dogs, cars, etc), or a more specialized model that was trained specifically to look at documents such as the LayoutLM family of models. 

Converting GPOTOAP2 to COCO

The GROTOAP2 dataset is in an old format called TrueViz XML and needs to be converted to a modern format to be used to fine-tune the Yolov5 or LayoutLM models.

 

Training YoloV5 with Converted GPOTOAP2 Data


python detect.py --weights runs/train/exp7/weights/best.pt --source ../grotoap2_test_images/ --imgsz 1100 1100 --line-thickness 1 --save-txt

python train.py --img 1100 --batch 16 --epoch 3 --data grotoap2_full.yaml --weights yolov5s.pt



Comments

Popular posts from this blog

Passwordless SSH