Corpora
Welcome to the FormosanBank Corpora section! Here, you’ll find comprehensive documentation of the corpora used in FormosanBank. Each corpus in this collection represents a unique linguistic dataset, encompassing various types of text and audio recordings. Our corpora are designed to support linguistic research, language education, and revitalization efforts, making these endangered languages accessible and analyzable for researchers, educators, and community members alike. Below is a list of the current corpora of which FormosanBank consists:
Coming Soon
In addition to the wide range of corpora already incorporated into FormosanBank, there is a large number of further corpora that permission to include in FormosanBank has been obtained and they are being processed at the moment. Below are some of these corpora:
NTU Corpus (Various languages)
Glosbe (Amis, Truku, Atayal, Saisayat)
Xuan's books (Paiwan)
Amis Texts - Montgomery (Amis)
Wakelin (1958) Yami texts (Yami)
Matthew's Gospel and John's Gospel (Siraya)
100 Paiwan Texts (Paiwan)
The Sedik Language of Formosa by Erin Asai (Seediq)
Chang's Seediq Reference Grammar (Seediq)
Chang's Kavalan reference grammar (Kavalan)
Poinsot Amis Dictionary (Amis)
Moedict Amis (Amis)
Whitehorn Collection (Paiwan, Amis, Atayal)
Asai's Seediq Language of Formosan (Seediq)
Wilang Yutas videos (Atayal)
hala saku la (videos - Atayal)
hala saku la (text - Atayal)
Tung's Descriptive study of Tsou (Tsou)
Jeng (1992) Topic and focus in Bunun (Bunun)
Blust's Thao Dictionary (Thao)
Rau & Dong (2006) Yami texts with reference grammar and dictionary (Yami)
Current numbers
As of the time you're reading this, FormosanBank contains over 8 million tokens (precisely 8075594) as well as 731 hours and 40 minutes of recorded audio across the 16 Formosan languages. Below is a breakdown of the most up-to-date token count based on language and corpus as well as audio breakdown by language.
Token Count per Language
Language
Tokens
Amis
2,213,003
Atayal
907,763
Paiwan
492,056
Bunun
318,422
Puyuma
340,520
Rukai
358,879
Tsou
99,694
Saisiyat
109,512
Yami
128,404
Thao
121,970
Kavalan
132,412
Truku
115,948
Sakizaya
1,504,757
Seediq
1,044,350
Saaroa
79,458
Kanakanavu
108,446
Token Count per Corpus
Source
Tokens
Virginia Fey Dictionary
9,078 (Amis only)
ILRDF Dictionaries
659,295
NTU Paiwan ASR
68,332 (Paiwan only)
Presidential Apologies
29,793
Paiwan Stories
556 (Paiwan only)
Wikipedias
4,628,365
ePark
2,680,175
Recorded Audio per Language
Amis
72 hours 32 minutes 18 seconds
Atayal
87 hours 4 minutes 2 seconds
Paiwan
72 hours 1 minutes 24 seconds
Bunun
71 hours 13 minutes 28 seconds
Puyuma
71 hours 8 minutes 6 seconds
Rukai
88 hours 19 minutes 41 seconds
Tsou
21 hours 49 minutes 36 seconds
Saisiyat
23 hours 11 minutes 33 seconds
Yami
21 hours 41 minutes 46 seconds
Thao
21 hours 49 minutes 28 seconds
Kavalan
25 hours 41 minutes 1 second
Truku
22 hours 41 minutes 47 seconds
Sakizaya
24 hours 39 minutes 49 seconds
Seediq
49 hours 46 minutes 30 seconds
Saaroa
26 hours 23 minutes 26 seconds
Kanakanavu
31 hours 36 minutes 54 seconds
Last updated