ePark

Overview

The ePark corpus is a comprehensive and interactive resource for the preservation, learning, and revitalization of Indigenous languages of Taiwan. Developed by the Indigenous Languages Research and Development Foundation (ILRDF), this digital platform caters to a wide audience, including preschoolers, students, adults, and language teachers, with resources and tools designed to support various learning levels and linguistic goals. The corpus is available in all recognized 42 official dialects across the 16 different Formosan languages: Amis, Atayal, Saisiyat, Thao, Seediq, Bunun, Paiwan, Rukai, Truku, Kavalan, Tsou, Kanakanavu, Saaroa, Puyuma, Yami, and Sakizaya. The corpus is invaluable for documenting and preserving linguistic diversity, and it includes text, audio recordings, and translations, making it a comprehensive resource for research, education, and language revitalization.


Features of the online platform

  1. Teaching Modules:

    • Elementary, Intermediate, and Advanced Levels: Progressive exercises focusing on listening, speaking, and writing skills.

    • Interactive Components: Drag-and-drop activities, typing exercises, and dialogue recording.

  2. Vocabulary Analysis:

    • Analyze text or individual words for classification, level range, and teaching material usage.

  3. Situational Language:

    • Covers 12 practical topics such as greetings and classroom terms with dialogue comprehension and dictation exercises.

  4. Short Essays and Advanced Texts:

    • Includes rich topics with sentence structure exercises and translation practice.

  5. Resource Download:

    • Comprehensive access to teaching materials, including digital aids and printable resources.


Key Features of the Corpus in FormosanBank

  • Comprehensive Data

    • The corpus contains a total of 2680175 tokens and over 587 hours of audio, offering a rich and comprehensive resource for the study of Formosan languages. Its quality is ensured by the extensive effort and meticulous processes involved in its creation. Below is a breakdown of the token count and audio duration for each language:

  • Audio Integration

    • High-quality audio files accompany most example sentences, providing an invaluable resource for analyzing pronunciation, phonology, and spoken language contexts.

  • Translation Availability

    • Example sentences are translated into Chinese, ensuring accessibility for researchers and learners.


Significance

  1. Language Preservation

    • The corpus supports the documentation and revitalization of endangered indigenous languages, ensuring they remain a living cultural asset.

  2. Accessibility

    • The online platform makes these resources available to a global audience, transcending geographical boundaries.

  3. Education and Research

    • With rich examples and structured content, the ePark platform is invaluable for both language learners and researchers.

    • Having the same content translated across the different languages and across different dialects of the same language is extremely valuable in learning about and further studying the Formosan languages, which makes the corpus an integral part of FormosanBank.


Access Details

  • Visit the ePark corpus online platform at https://web.klokah.tw/

  • The repo containing the ePark corpus in FormosanBank as well as the code to reconstruct the corpus can be found here.


Acknowledgments

The ePark corpus was developed through collaboration between ILRDF, educators, linguists, and Indigenous communities. Without the tremendous effort and collaborations of these entities, it wouldn't have been possible to have such a valuable resource in FormosanBank.


Corpus Notes

This section will be used to describe any notes regarding the corpus (e.g. explaining the presence of characters that aren't part of the standard orthography)

  • In the Amis corpus of ePark, the letter b occurs a number of times, even though it isn't part of the official orthography. Specifically, there are only 4 sentences with occurrences of the letter b, and the b's occur in two words: “Coylienboy” and Coy-lien-boy (which seems to be broken down version of the first word). Both of them are quoted, so this isn't an indication of an issue, orthography-wise, with the corpus.

  • Similar to b, there are 13 occurances in the Amis corpus of the letter z, even though it's not part of the orthography. all of the occurrences are happening in the two words Sakizaya and E-ziday, which seem to be quoted as well.

  • In the southern Amis dialect specifically, there are around ~3000 sentences/words with v's in them. That's the only dialect with the letter v among all the Amis dialect, and it's likely a letter that's used only in this dialect.


The content of the ePark corpus is released under the creative commons liscence (CC BY-NC-SA 4.0), and further info can be found here: https://web.klokah.tw/creativeCommons/


Citation

In accordance with our Terms of Use, if you use this corpus or any product derived from this corpus in any publication, you must cite both FormosanBank and:

  • Indigenous Languages Research and Development Foundation. (2020). 族語E樂園. https://web.klokah.tw/

Last updated