Digital Humanities - Meet our people - Vincent Buntinx

Chronocloud on a journalistic corpus: A tool for visualizing the whole corpus independently of its language, based on the temporal frequency profile of words or n-grams. © Vincent Buntinx / 2017 EPFL

Chronocloud on a journalistic corpus: A tool for visualizing the whole corpus independently of its language, based on the temporal frequency profile of words or n-grams. © Vincent Buntinx / 2017 EPFL

Vincent Buntinx is passionate about science and mathematics. After a licence (equivalent to a Master degree) in Physics and a master in Actuarial Science both obtained in Belgium, Vincent moved to Lausanne to work as an actuary in the field of health insurance. Four years later he returned to the field of academia and is now finalizing his thesis in Digital Humanities at EPFL in the DHLAB.

The evolution of the French Language

The recent availability of a large body of digitized texts extending over several centuries opens the way to new forms of studies on the evolution of languages. In this thesis, we are studying a corpus of 4 million press articles spanning a period of 200 years. This thesis is attempting to measure the evolution of French over this period both in terms of words and expressions but also in a more global way while trying to define integrated measures of linguistic evolution.

The methodological choice is to introduce almost no linguistic hypothesis in this study by developing new measures around the simple notion of n-gram, a sequence of n consecutive words. On this basis the thesis explores the potential of already known concepts such as temporal frequency profiles and their diachronic correlations, and also introduces new abstractions such as the notion of resilient linguistic kernel or the decomposition of profiles into solidified expressions according to simple statistical models. With the use of distributed computational techniques, it develops methods to test the relevance of these concepts to a large number of linguistic events and thus allows to offer a virtual observatory of the diachronic evolutions associated with a given corpus.

On this basis, this thesis explores more precisely the multi-scale dimension of linguistic phenomena by considering how standardized measures evolve when applied to increasingly long n-grams. The discrete and continuous scale from isolated entities (n=1) to complex and structured expressions (n=9) offers an axis of study transverse to the classical differentiations which ordinarily structure linguistics: syntax, semantics, the pragmatic, and so on. The thesis explores the quantitative and qualitative diversity of phenomena at these different scales of language and develops a novel approach by proposing multiscale measurements and formalizations, with the aim of characterizing more fundamental structural aspects of the phenomena studied.

Corpus Analysis Tools

Please click here to see the DHLAB Corpus Analysis Tools

To read more about Vincent Buntinx, please click here



Images to download

Chronocloud on the corpus of Google Books in Hebrew © Vincent Buntinx / 2017 EPFL
Chronocloud on the corpus of Google Books in Hebrew © Vincent Buntinx / 2017 EPFL
A tool for decomposing a word into a sum of n-grams. © Vincent Buntinx / 2017 EPFL
A tool for decomposing a word into a sum of n-grams. © Vincent Buntinx / 2017 EPFL
Tool allowing to visualize the temporal frequency profile of an n-gram. © Vincent Buntinx/2017 EPFL
Tool allowing to visualize the temporal frequency profile of an n-gram. © Vincent Buntinx/2017 EPFL

Share on