ePark
Overview
The ePark corpus is a comprehensive and interactive resource for the preservation, learning, and revitalization of Indigenous languages of Taiwan. Developed by the Indigenous Languages Research and Development Foundation (ILRDF), this digital platform caters to a wide audience, including preschoolers, students, adults, and language teachers, with resources and tools designed to support various learning levels and linguistic goals. The corpus is available in all recognized 42 official dialects across the 16 different Formosan language. The corpus is invaluable for documenting and preserving linguistic diversity, and it includes text, audio recordings, and translations, making it a comprehensive resource for research, education, and language revitalization.
Corpus Statistics
Amis Coastal
Amis Hengchun
Amis Malan
Amis Southern
Amis Xiuguluan
Bunun Junqun
Bunun Kaqun
Bunun Luanqun
Bunun Tanqun
Bunun Zhuoqun
Kavalan
Rukai Dawu
Rukai Dona
Rukai Eastern
Rukai Maolin
Rukai Wanshan
Rukai Wutai
Paiwan Central
Paiwan Eastern
Paiwan Northern
Paiwan Southern
Puyuma Jianhe
Puyuma Nanwang
Puyuma Xiqun
Puyuma Zhiben
Thao
Saaroa
Sakizaya
Yami
Atayal FourSeasons
Atayal Sekolik
Atayal Wanda
Atayal Wenshui
Atayal YilanZeaol
Atayal Zeaol
Seediq DeluValley
Seediq Duda
Seediq Tegudaya
Truku
Tsou
Kanakanabu
Saisiyat
Word count
58,782
36,750
36,137
36,328
37,157
52,951
29,654
30,595
29,721
28,834
55,427
29,992
28,565
30,151
27,693
24,350
54,316
34,761
35,999
59,628
38,127
35,383
65,251
37,157
39,643
55,061
42,794
53,556
63,454
38,172
59,304
31,056
37,004
40,229
39,479
39,861
41,788
57,748
56,374
56,625
48,937
50,597
Total audio
13.4h
10.2h
8.4h
9.2h
9.9h
16.2h
8.0h
7.9h
8.6h
8.1h
13.1h
9.5h
8.9h
9.4h
8.8h
8.4h
15.9h
9.4h
10.0h
13.9h
10.5h
9.4h
19.2h
9.9h
10.8h
11.5h
16.0h
14.0h
13.9h
9.2h
14.0h
8.8h
9.5h
10.3h
9.9h
11.7h
10.1h
12.4h
13.7h
15.2h
16.1h
12.5h
Transcribed
13.4h
10.2h
8.4h
9.2h
9.9h
16.2h
8.0h
7.9h
8.6h
8.1h
13.1h
9.5h
8.9h
9.4h
8.8h
8.4h
15.9h
9.4h
10.0h
13.9h
10.5h
9.4h
19.2h
9.9h
10.8h
11.5h
16.0h
14.0h
13.9h
9.2h
14.0h
8.8h
9.5h
10.3h
9.9h
11.7h
10.1h
12.4h
13.7h
15.2h
16.1h
12.5h
Untranscribed
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Translated sentences
English
2,157
2,155
2,144
2,158
2,156
2,159
2,154
2,163
2,154
2,156
2,129
2,151
2,154
2,149
2,158
2,150
2,157
2,184
2,141
2,202
2,154
2,153
2,150
2,229
2,155
2,122
2,150
2,149
2,065
2,159
2,309
2,154
2,156
2,160
2,149
2,159
2,158
2,158
2,154
2,150
2,148
2,155
Mandarin
7,520
6,325
6,276
6,076
6,373
7,196
6,109
6,142
5,932
5,946
7,999
6,283
6,193
6,031
6,149
6,295
7,512
6,467
6,241
8,095
6,639
6,502
8,545
6,581
6,648
7,451
7,754
7,386
7,449
6,714
7,776
5,896
6,035
6,276
6,369
6,367
6,264
7,393
7,280
7,652
7,771
7,364
Morphologically segmented
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Proportion glossed
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
Access Details
Visit the ePark corpus online platform at https://web.klokah.tw/
The repo containing the ePark corpus in FormosanBank as well as the code to reconstruct the corpus can be found here.
Acknowledgments
The ePark corpus was developed through collaboration between ILRDF, educators, linguists, and Indigenous communities. Without the tremendous effort and collaborations of these entities, it wouldn't have been possible to have such a valuable resource in FormosanBank.
Corpus Notes
The corpus appears to use the standard orthography ("Ortho113" in FormosanBank nomenclature), with some exceptions:
Ortho113 specifies that only the Nanshi dialect uses
u, while the other dialects useo. However, there are a fair number ofos andus throughout the dialect, irrespective of dialect. Thestandardtier normalizes these to match the standard orthography. For the IPA transcriptions of theoriginaltier, we recognize eitherooruas referring to the same phoneme. .
Copyright
The content of the ePark corpus is released under the creative commons liscence (CC BY-NC-SA 4.0), and further info can be found here: https://web.klokah.tw/creativeCommons/
Citation
In accordance with our Terms of Use, if you use this corpus or any product derived from this corpus in any publication, you must cite both FormosanBank and:
Indigenous Languages Research and Development Foundation. (2020). 族語E樂園. https://web.klokah.tw/
Last updated