One of the projects that have been keeping me busy in the past few years (funded by NSF’s support, BCS-1829290), is the curation of the XPF Corpus (/zɪf kɔɹpəs/). The main motivation is to enable cross-linguistic study of non-categorical universals in human sound systems. For instance, we know that final obstruent devoicing is common cross-linguistically, but do languages that do not have categorical final devoicing exhibit final devoicing as a lexical dispreference? There are currently several corpora that provide excellent cross-linguistic coverage of the existence of categorical phenomena (UPSID, WALS, P-base, and PHOIBLE, to name a few), but very few make it possible to ask whether a statistical trend exists in the lexicon of more than a handful of languages.

The XPF corpus aims to bridge this gap. There are hundreds of languages whose written form can be translated back to its phonemic form, and when combined with existing corpora, such as the Crúbadán Corpus, the phonemic representations can be used to answer questions that necessitate information about the fine grained distribution of sounds and their environments. The corpus comprises a set of rules that specify how alphabets can be translated to phonemic representations, language documentation summaries that justify particular correspondences, and Python and JavaScript code that can read a particular rule set, and use it to translate written input to its corresponding phonemic representation.

The corpus makes it possible to ask numerous questions that were difficult to address in the past. We immediately embarked on answering a number of them, and we hope many other would join us:

A preliminary manual of the corpus is available here.