Skip to Main content Skip to Navigation
Conference papers

GREYC@FinTOC-2022: Handling Document Layout and Structure in Native PDF Bundle of Documents

Emmanuel Giguet 1 Nadine Lucas 1 
1 Equipe SAFE - Laboratoire GREYC - UMR6072
GREYC - Groupe de Recherche en Informatique, Image et Instrumentation de Caen
Abstract : In this paper, we present our contribution to the FinTOC-2022 Shared Task "Financial Document Structure Extraction". We participated in the three tracks dedicated to English, French and Spanish document processing. Our main contribution consists in considering financial prospectus as a bundle of documents, i.e., a set of merged documents, each with their own layout and structure. Therefore, Document Layout and Structure Analysis (DLSA) first starts with the boundary detection of each document using general layout features. Then, the process applies inside each single document, taking advantage of the local properties. DLSA is achieved considering simultaneously text content, vectorial shapes and images embedded in the native PDF document. For the Title Detection task in English and French, we observed a significant improvement of the F-measures for Title Detection compared with those obtained during our previous participation.
Complete list of metadata

https://hal.archives-ouvertes.fr/hal-03741656
Contributor : Giguet Emmanuel Connect in order to contact the contributor
Submitted on : Monday, August 1, 2022 - 4:33:22 PM
Last modification on : Wednesday, August 3, 2022 - 9:45:06 AM

File

Giguet-Lucas-Fintoc-Bundle-202...
Publisher files allowed on an open archive

Identifiers

  • HAL Id : hal-03741656, version 1

Citation

Emmanuel Giguet, Nadine Lucas. GREYC@FinTOC-2022: Handling Document Layout and Structure in Native PDF Bundle of Documents. 4th Financial Narrative Processing Workshop (FNP 2022), Jun 2022, Marseille, France. pp.100-104. ⟨hal-03741656⟩

Share

Metrics

Record views

10

Files downloads

9