Abstract
In recent years, there has been an actual effort to constitute and promote children’s writings corpora especially in French. The first research works on writing acquisition relied on small corpora that were not widely distributed. Longitudinal corpora, monitoring a cohort of children’s productions from similar collection conditions from one year to the next, do not exist in French yet.Moreover, although natural language processing (NLP) has provided tools for a wide variety of corpora, few studies have been conducted on children's writings corpora. This new scope represents a challenge for the NLP field because of children's writings specificities, and particularly their deviation from the written norm. Hence, tools currently available are not suitable for the exploitation of these corpora. There is therefore a challenge for NLP to develop specific methods for these written productions.This thesis provides two main contributions. On the one hand, this work has led to the creation of a large and digitized longitudinal corpus of children's writings (from 6 to 11 years old) named the Scoledit corpus. Its constitution implies the collection, the digitization and the transcription of productions, the annotation of linguistic data and the dissemination of the resource thus constituted. On the other hand, this work enables the development of a method exploiting this corpus, called the comparison approach, which is based on the comparison between the transcription of children’s productions and their standardized version.In order to create a first level of alignment, this method compared transcribed forms to their normalized counterparts, using the aligner AliScol. It also made possible the exploration of various linguistic analyses (lexical, morphographic, graphical). And finally, in order to analyse graphemes, an aligner of transcribed and normalized graphemes, called AliScol_Graph was created.