How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures


In the last decade, OCR progress has triggered a massive trend towards the digitisation of legacy documents, with several Digital Humanities projects exploring means for structuring retro-digitised dictionaries. However there is a lack of awareness of the impact of the OCRs quality on the information extraction process. In this work, we shed light on the relationship between these two steps through experiments carried out with a TEI-based system for automatic parsing of dictionaries.

19th annual Conference and Members’ Meeting of the Text Encoding Initiative Consortium (TEI) -What is text, really? TEI and beyond
Pedro Ortiz Suarez
Pedro Ortiz Suarez
Investigador Senior

Soy investigador senior en la Fundación Common Crawl.