GoTriple - Introducing NT2Lex: A Machine-readable CEFR-graded Lexical Resource for Dutch as a Foreign Language

authors

Contributors

UCL - SSH/ILC - Institut Langage et Communication,

UCL - SSH/ILC/PLIN - Pôle de recherche en linguistique,

UCL - SSH/TALN - Centre de traitement automatique du langage

Abstract

The recent years have seen the emergence of a number of corpus-based graded lexical resources, which include lexical entries graded along a particular learning or difficulty scale. We argue that these graded lexicons are a step towards rendering the inherent complexity of words more apparent – contrary to traditional (single-level) frequency-based lexicons – and could thus find their utility in the field of automatic text simplification, to name but one. However, until now, this type of resource has only been made available for a few languages, including French as a first and second (L2) language (Lété et al., 2004; François et al., 2014) and Swedish L2 (François et al., 2016). The goal of our current work is therefore to expand upon these previous developments, presenting a similar resource for Dutch as a foreign language. Our presentation will be twofold. On the one hand, we will present the alpha version of the NT2Lex resource. We will describe the common methodology used for collecting a corpus of readers and textbooks grader per level of the Common European Framework of Reference (CEFR) and for extracting and refining the per-level lexical frequencies from the collected texts. On the other hand, we will present a concrete application of the resource (and of a graded lexicon in general), which is linked to the task of complex word identification and prediction. In particular, we will address the concrete challenges we’re faced with when using a predictive model of vocabulary knowledge at a given CEFR level.