GoTriple - How to Do Lexical Quality Estimation of a Large OCRed Historical Finnish Newspaper Collection with Scarce Resources

Abstract

The National Library of Finland has been digitised and made available the historical newspills produced in Finland between 1771 and 1910 (Bremer-Laamanen 2014; Kettunen et al. 2014). This collection containing approximately 1.95 million pages in Finnish and Swedish. The Finnish part of the collection consists of about 2.40 trillion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. An open data package of the whole collection was released in early 2017 (Pääkkönen et al. 2016). Quality of OCRed collections is an important topic in digital humanities, as it affects general usability and capability of collections. There is no single available method to assess quality of large collections, but different methods can be used to approximate quality. This paper discusses different corpus analysis style methods to approximate overall lexical quality of the Finnish part of the Digi collection. Methods including use of parallel samples and word error rates, use of Morphological analysers, frequency analysis of words and comparisons to comparable terminology data. Our aim in the quality analysis is twofold: Firstly to analyse the present state of the data and secondary, to establish a set of assessment methods that build up a compact procedure for quality assessment after e.g. re-OCRing or post correction of the material. Summary The National Library of Finland digitised and made available historical newspapers published in Finland between 1771 and 1910 (Bremer-Laamanen 2014; Kettunen et al. 2014). This collection contains approximately 1,95 million pages in Finnish and Swedish. The Finnish part of the collection consists of around 2,40 billion words. Digitised Collections of the National Library are offered on the digi.kansalliskirjasto.fi web service, also known as Digi. A set of available data from the entire collection was released in early 2017 (Pääkkönen et al. 2016). The quality of OCR collections is an important theme for digital humanities, as it concerns the usefulness and ease of searching collections. There is not only one method for assessing the quality of large collections, but different methods can be used to estimate their quality. This article discusses different methods of corpus analysis to estimate the total lexical quality of the Finnish part of the Digi collection. Methods include the use of parallel samples and word error frequencies, the use of morphological analysers, word frequency analysis and comparisons with comparable written lexical data. Our objective in the quality analysis is twofold: first, to analyse the current state of the lexical data and, second, to establish a set of evaluation methods that constitute a compact procedure for quality assessment after, for example, re-conversion into OCR or after correction of the material. Keywords: quality of OCR; estimation of lexical quality; collection of Finnish newspapers of the 19th century