XPF Corpus

One of the projects that have been keeping me busy in the past few years (funded by NSF’s support, BCS-1829290), is the curation of the XPF Corpus (/zɪf kɔɹpəs/). The main motivation is to enable cross-linguistic study of non-categorical universals in human sound systems. For instance, we know that final obstruent devoicing is common cross-linguistically, but do languages that do not have categorical final devoicing exhibit final devoicing as a lexical dispreference? There are currently several corpora that provide excellent cross-linguistic coverage of the existence of categorical phenomena (UPSID, WALS, P-base, and PHOIBLE, to name a few), but very few make it possible to ask whether a statistical trend exists in the lexicon of more than a handful of languages.

The XPF corpus aims to bridge this gap. There are hundreds of languages whose written form can be translated back to its phonemic form, and when combined with existing corpora, such as the Crúbadán Corpus, the phonemic representations can be used to answer questions that necessitate information about the fine-grained distribution of sounds and their environments. The corpus comprises a set of rules that specify how alphabets can be translated to phonemic representations, language documentation summaries that justify particular correspondences, and Python and JavaScript code that can read a particular rule set, and use it to translate written input to its corresponding phonemic representation.

The corpus makes it possible to ask numerous questions that were difficult to address in the past. We immediately embarked on answering a number of them, and we hope many others would join us:

Are sound systems of different languages similar to one another in the amount of information similar segments provide? (yes!)
Do categorical trends such as final-devoicing manifest as soft constraints in languages that do not obey the categorical restriction? (yes!)
Are human languages designed to avoid information overload? (It seems so!)
Can we use distributional data to bear on outlier languages such as Somali (which seems to only allow final voiced obstruents) and Georgian (which is argued not to have a plain stop series)? (it’s complicated…)

Resources

A manual of the corpus is available here.
The main website
The GitHub repository of the corpus
A new python library, xpfcorpus, makes it easier to integrate the corpus in other workflows:
```
from xpfcorpus import Transcriber
es = Transcriber("es")
print(es.transcribe("ejemplo"))
```
[‘e’, ‘x’, ‘e’, ‘m’, ‘p’, ‘l’, ‘o’]

Use pip install xpfcorpus to install.

The package is pure python with no dependencies. Read the documentation or visit the github repository