ePark

Overview

The ePark corpusarrow-up-right is a comprehensive and interactive resource for the preservation, learning, and revitalization of Indigenous languages of Taiwan. Developed by the Indigenous Languages Research and Development Foundation (ILRDFarrow-up-right), this digital platform caters to a wide audience, including preschoolers, students, adults, and language teachers, with resources and tools designed to support various learning levels and linguistic goals. The corpus is available in all recognized 42 official dialects across the 16 different Formosan language. The corpus is invaluable for documenting and preserving linguistic diversity, and it includes text, audio recordings, and translations, making it a comprehensive resource for research, education, and language revitalization.


Corpus Statistics

Amis Coastal

Amis Hengchun

Amis Malan

Amis Southern

Amis Xiuguluan

Bunun Junqun

Bunun Kaqun

Bunun Luanqun

Bunun Tanqun

Bunun Zhuoqun

Kavalan

Rukai Dawu

Rukai Dona

Rukai Eastern

Rukai Maolin

Rukai Wanshan

Rukai Wutai

Paiwan Central

Paiwan Eastern

Paiwan Northern

Paiwan Southern

Puyuma Jianhe

Puyuma Nanwang

Puyuma Xiqun

Puyuma Zhiben

Thao

Saaroa

Sakizaya

Yami

Atayal FourSeasons

Atayal Sekolik

Atayal Wanda

Atayal Wenshui

Atayal YilanZeaol

Atayal Zeaol

Seediq DeluValley

Seediq Duda

Seediq Tegudaya

Truku

Tsou

Kanakanabu

Saisiyat

Word count

58,782

36,750

36,137

36,328

37,157

52,951

29,654

30,595

29,721

28,834

55,427

29,992

28,565

30,151

27,693

24,350

54,316

34,761

35,999

59,628

38,127

35,383

65,251

37,157

39,643

55,061

42,794

53,556

63,454

38,172

59,304

31,056

37,004

40,229

39,479

39,861

41,788

57,748

56,374

56,625

48,937

50,597

Total audio

13.4h

10.2h

8.4h

9.2h

9.9h

16.2h

8.0h

7.9h

8.6h

8.1h

13.1h

9.5h

8.9h

9.4h

8.8h

8.4h

15.9h

9.4h

10.0h

13.9h

10.5h

9.4h

19.2h

9.9h

10.8h

11.5h

16.0h

14.0h

13.9h

9.2h

14.0h

8.8h

9.5h

10.3h

9.9h

11.7h

10.1h

12.4h

13.7h

15.2h

16.1h

12.5h

Transcribed

13.4h

10.2h

8.4h

9.2h

9.9h

16.2h

8.0h

7.9h

8.6h

8.1h

13.1h

9.5h

8.9h

9.4h

8.8h

8.4h

15.9h

9.4h

10.0h

13.9h

10.5h

9.4h

19.2h

9.9h

10.8h

11.5h

16.0h

14.0h

13.9h

9.2h

14.0h

8.8h

9.5h

10.3h

9.9h

11.7h

10.1h

12.4h

13.7h

15.2h

16.1h

12.5h

Untranscribed

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Translated sentences

English

2,157

2,155

2,144

2,158

2,156

2,159

2,154

2,163

2,154

2,156

2,129

2,151

2,154

2,149

2,158

2,150

2,157

2,184

2,141

2,202

2,154

2,153

2,150

2,229

2,155

2,122

2,150

2,149

2,065

2,159

2,309

2,154

2,156

2,160

2,149

2,159

2,158

2,158

2,154

2,150

2,148

2,155

Mandarin

7,520

6,325

6,276

6,076

6,373

7,196

6,109

6,142

5,932

5,946

7,999

6,283

6,193

6,031

6,149

6,295

7,512

6,467

6,241

8,095

6,639

6,502

8,545

6,581

6,648

7,451

7,754

7,386

7,449

6,714

7,776

5,896

6,035

6,276

6,369

6,367

6,264

7,393

7,280

7,652

7,771

7,364

Morphologically segmented

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Proportion glossed

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%


Access Details


Acknowledgments

The ePark corpus was developed through collaboration between ILRDF, educators, linguists, and Indigenous communities. Without the tremendous effort and collaborations of these entities, it wouldn't have been possible to have such a valuable resource in FormosanBank.


Corpus Notes

The corpus appears to use the standard orthography ("Ortho113" in FormosanBank nomenclature), with some exceptions:

  • Ortho113 specifies that only the Nanshi dialect uses u, while the other dialects use o. However, there are a fair number of os and us throughout the dialect, irrespective of dialect. The standard tier normalizes these to match the standard orthography. For the IPA transcriptions of the original tier, we recognize either o or u as referring to the same phoneme. .


The content of the ePark corpus is released under the creative commons liscence (CC BY-NC-SA 4.0), and further info can be found here: https://web.klokah.tw/creativeCommons/arrow-up-right


Citation

In accordance with our Terms of Use, if you use this corpus or any product derived from this corpus in any publication, you must cite both FormosanBank and:

  • Indigenous Languages Research and Development Foundation. (2020). 族語E樂園. https://web.klokah.tw/

Last updated