Skip to Main content Skip to Navigation
Conference papers

GREYC@FinTOC-2022: Handling Document Layout and Structure in Native PDF Bundle of Documents

Emmanuel Giguet 1 Nadine Lucas 1 
1 Equipe SAFE - Laboratoire GREYC - UMR6072
GREYC - Groupe de Recherche en Informatique, Image et Instrumentation de Caen
Abstract : In this paper, we present our contribution to the FinTOC-2022 Shared Task "Financial Document Structure Extraction". We participated in the three tracks dedicated to English, French and Spanish document processing. Our main contribution consists in considering financial prospectus as a bundle of documents, i.e., a set of merged documents, each with their own layout and structure. Therefore, Document Layout and Structure Analysis (DLSA) first starts with the boundary detection of each document using general layout features. Then, the process applies inside each single document, taking advantage of the local properties. DLSA is achieved considering simultaneously text content, vectorial shapes and images embedded in the native PDF document. For the Title Detection task in English and French, we observed a significant improvement of the F-measures for Title Detection compared with those obtained during our previous participation.
Complete list of metadata
Contributor : Giguet Emmanuel Connect in order to contact the contributor
Submitted on : Monday, August 1, 2022 - 4:33:22 PM
Last modification on : Wednesday, August 3, 2022 - 9:45:06 AM


Publisher files allowed on an open archive


  • HAL Id : hal-03741656, version 1


Emmanuel Giguet, Nadine Lucas. GREYC@FinTOC-2022: Handling Document Layout and Structure in Native PDF Bundle of Documents. 4th Financial Narrative Processing Workshop (FNP 2022), Jun 2022, Marseille, France. pp.100-104. ⟨hal-03741656⟩



Record views


Files downloads