One of the projects that have been keeping me busy in the past few years (funded by NSF’s support, BCS-1829290), is the curation of the XPF Corpus (/zɪf kɔɹpəs/). The main motivation is to enable cross-linguistic study of non-categorical universals in human sound systems. For instance, we know that final obstruent devoicing is common cross-linguistically, but do languages that do not have categorical final devoicing exhibit final devoicing as a lexical dispreference? There are currently several corpora that provide excellent cross-linguistic coverage of the existence of categorical phenomena (UPSID, WALS, P-base, and PHOIBLE, to name a few), but very few make it possible to ask whether a statistical trend exists in the lexicon of more than a handful of languages.
The corpus makes it possible to ask numerous questions that were difficult to address in the past. We immediately embarked on answering a number of them, and we hope many other would join us:
Are sound systems of different languages similar to one another in the amount of information similar segments provide? (yes!)
Do categorical trends such as final-devoicing manifest as soft constraints in languages that do not obey the categorical restriction? (yes!)
Are human languages designed to avoid information overload (it seems so!)
Can we use distributional data to bear on outlier languages such as Somali (that seems to only allow final voiced obstruents) and Georgian (that is argued not to have a plain stop series)? (it’s complicated…)
A preliminary manual of the corpus is available here.