100 Paiwan Texts
Overview
The 100 Paiwan Texts corpus contains 100 morphologically parsed and glossed texts from Early and Whitehorn (2003).
Corpus Processing
This corpus was originally published as a book. Prof. Early graciously provided the original Word document, which we scraped. The following steps were followed:
The free translation was not always on its own line in the docx file. This was fixed manually, the result being
Paiwan Ch2.docx.The scripts in the Jupyter notebook
script.ipynbwas then used to create the XMLs.The character encodings from the original Word document did not transfer correctly, and no automatic solution was found. These were fixed using regular expressions (probably; unfortunately, the exact process was not recorded).
The following lines were fixed by hand, following the authors' errata notes:
story 061, sentence 034 story 062, sentence 022 story 071, sentence 064 story 072, sentence 045 story 074, sentence 070 story 075, sentence 075 story 095, sentence 008 story 096, sentence 049 story 097, sentence 022
The FormosanBank QC scripts
clean_xml.pyandstandardize.pywere run, as per usual procedure. This mostly standardizes punctuation.Because the original used a slightly different orthography than the modern orthography, the local script
convert.pywas used to convert the orthography to the standard. A table showing the conversion of the original orthography to IPA is available in the book itself. We used a similar conversion table provided by ILRDF for the standard modern orthography we are using. Note that Early and Whitehorn describe a few of the phonemes slightly differently from ILRDF, perhaps reflecting dialectal differences.
Corpus Notes
At present, the final XMLs have not been carefully checked against the published PDF. This was judged low-priority due to the fact we were working from the original docx file. Please report any errors spotted to the FormosanBank maintainers.
Access Details
Users of this corpus are encouraged to obtain a copy of the original text. A PDF is currently available for free from the Australian National University. This book has extensive information about the texts and the language, including a sketch grammar.
Citation
Early, R. J., and Whitehorn, J. (2003). One hundred Paiwan texts. Pacific Linguistics, Research School of Pacific and Asian Studies, The Australian National University.
This corpus is available CC BY-NC, with permission of the R. J. Early. The text itself is also freely available online from several sources.
Last updated