Corpora

Welcome to the FormosanBank Corpora section! Here, you’ll find comprehensive documentation of the corpora used in FormosanBank. Each corpus in this collection represents a unique linguistic dataset, encompassing various types of text and audio recordings. Our corpora are designed to support linguistic research, language education, and revitalization efforts, making these endangered languages accessible and analyzable for researchers, educators, and community members alike. Below is a list of the current corpora of which FormosanBank consists:


Coming Soon

In addition to the wide range of corpora already incorporated into FormosanBank, there is a large number of further corpora that permission to include in FormosanBank has been obtained and they are being processed at the moment. Below are some of these corpora:

  • NTU Corpus (Various languages)

  • Glosbe (Amis, Truku, Atayal, Saisayat)

  • Xuan's books (Paiwan)

  • Amis Texts - Montgomery (Amis)

  • Wakelin (1958) Yami texts (Yami)

  • Matthew's Gospel and John's Gospel (Siraya)

  • 100 Paiwan Texts (Paiwan)

  • The Sedik Language of Formosa by Erin Asai (Seediq)

  • Chang's Seediq Reference Grammar (Seediq)

  • Chang's Kavalan reference grammar (Kavalan)

  • Poinsot Amis Dictionary (Amis)

  • Moedict Amis (Amis)

  • Whitehorn Collection (Paiwan, Amis, Atayal)

  • Asai's Seediq Language of Formosan (Seediq)

  • Wilang Yutas videos (Atayal)

  • ​hala saku la (videos - Atayal)

  • ​hala saku la (text - Atayal)

  • Tung's Descriptive study of Tsou (Tsou)

  • Jeng (1992) Topic and focus in Bunun (Bunun)

  • Blust's Thao Dictionary (Thao)

  • Rau & Dong (2006) Yami texts with reference grammar and dictionary (Yami)


Current numbers

As of the time you're reading this, FormosanBank contains over 8 million tokens (precisely 8075594) as well as 731 hours and 40 minutes of recorded audio across the 16 Formosan languages. Below is a breakdown of the most up-to-date token count based on language and corpus as well as audio breakdown by language.

Token Count per Language

Language

Tokens

Amis

2,213,003

Atayal

907,763

Paiwan

492,056

Bunun

318,422

Puyuma

340,520

Rukai

358,879

Tsou

99,694

Saisiyat

109,512

Yami

128,404

Thao

121,970

Kavalan

132,412

Truku

115,948

Sakizaya

1,504,757

Seediq

1,044,350

Saaroa

79,458

Kanakanavu

108,446


Token Count per Corpus

Source

Tokens

Virginia Fey Dictionary

9,078 (Amis only)

ILRDF Dictionaries

659,295

NTU Paiwan ASR

68,332 (Paiwan only)

Presidential Apologies

29,793

Paiwan Stories

556 (Paiwan only)

Wikipedias

4,628,365

ePark

2,680,175

Recorded Audio per Language

Language
Duration

Amis

72 hours 32 minutes 18 seconds

Atayal

87 hours 4 minutes 2 seconds

Paiwan

72 hours 1 minutes 24 seconds

Bunun

71 hours 13 minutes 28 seconds

Puyuma

71 hours 8 minutes 6 seconds

Rukai

88 hours 19 minutes 41 seconds

Tsou

21 hours 49 minutes 36 seconds

Saisiyat

23 hours 11 minutes 33 seconds

Yami

21 hours 41 minutes 46 seconds

Thao

21 hours 49 minutes 28 seconds

Kavalan

25 hours 41 minutes 1 second

Truku

22 hours 41 minutes 47 seconds

Sakizaya

24 hours 39 minutes 49 seconds

Seediq

49 hours 46 minutes 30 seconds

Saaroa

26 hours 23 minutes 26 seconds

Kanakanavu

31 hours 36 minutes 54 seconds

Last updated