Document Layout Analysis Datasets
Here is a list of Document Layout Analysis data sets and the labels which they contain.
Data Set | Labels |
---|---|
DocBank |
Footer, List, Paragraph, Reference, Section, Table, Title |
PubLayNet |
|
DocLayNet |
Page-header, Picture, Section-header, Table, Text, Title |
GROTOAP2 |
Body_content, Conflict_statement, Correspondence, Dates, Editors, Figure, Glossary, keywords, Page_number, References, Table, Title, Title_author, Type, Unknown |
These data sets can be used to either train a model from scratch or to fine-tune a pre-trained model.
A pre-trained model could be a general object recognition model like Yolov5 which was trained for object recognition of objects in pictures (ex. cats, dogs, cars, etc), or a more specialized model that was trained specifically to look at documents such as the LayoutLM family of models.
Converting GPOTOAP2 to COCO
The GROTOAP2 dataset is in an old format called TrueViz XML and needs to be converted to a modern format to be used to fine-tune the Yolov5 or LayoutLM models.
Training YoloV5 with Converted GPOTOAP2 Data
python detect.py --weights runs/train/exp7/weights/best.pt --source ../grotoap2_test_images/ --imgsz 1100 1100 --line-thickness 1 --save-txt
python train.py --img 1100 --batch 16 --epoch 3 --data grotoap2_full.yaml --weights yolov5s.pt
Comments
Post a Comment