Corpora

Welcome to the FormosanBank Corpora section! Here, you’ll find comprehensive documentation of the corpora used in FormosanBank. Each corpus in this collection represents a unique linguistic dataset, encompassing various types of text and audio recordings. Our corpora are designed to support linguistic research, language education, and revitalization efforts, making these endangered languages accessible and analyzable for researchers, educators, and community members alike. Below is a list of the current corpora of which FormosanBank consists:

ePark (text, audio, English, Mandarin)
ILRDF Dictionaries (text, audio, Mandarin)
Wikipedias (text)
Presidential Apologies (text, English, Mandarin)
NTU Paiwan ASR (text, audio)
Virginia Fey's Amis Dictionary (text, English, Mandarin)
Paiwan Stories (text, audio, Mandarin)
Rau-Dong (text, glossing, English, Mandarin)
The Montgomery Texts (text, English)
100 Paiwan Texts (text, glossing, English)

Coming Soon

In addition to the wide range of corpora already incorporated into FormosanBank, there is a large number of further corpora that permission to include in FormosanBank has been obtained and they are being processed at the moment. Below are some of these corpora:

NTU Corpus (Various languages)
Glosbe (Amis, Truku, Atayal, Saisayat)
Wakelin (1958) Yami texts (Yami)
Matthew's Gospel and John's Gospel (Siraya)
The Sedik Language of Formosa by Erin Asai (Seediq)
Chang's Seediq Reference Grammar (Seediq)
Chang's Kavalan reference grammar (Kavalan)
Poinsot Amis Dictionary (Amis)
Moedict Amis (Amis)
Whitehorn Collection (Paiwan, Amis, Atayal)
Asai's Seediq Language of Formosan (Seediq)
Wilang Yutas videos (Atayal)
hala saku la (videos - Atayal)
hala saku la (text - Atayal)
Tung's Descriptive study of Tsou (Tsou)
Jeng (1992) Topic and focus in Bunun (Bunun)
Blust's Thao Dictionary (Thao)

Current numbers

As of the time you're reading this, FormosanBank contains over 8 million tokens (precisely 8075594) as well as 731 hours and 40 minutes of recorded audio across the 16 Formosan languages. Below is a breakdown of the most up-to-date token count based on language and corpus as well as audio breakdown by language.

Token Count per Language

Language

Tokens

Amis

2,184,946

Atayal

870,844

Paiwan

480,574

Bunun

317,946

Puyuma

330,571

Rukai

358,879

Tsou

99,694

Saisiyat

109,512

Yami

128,404

Thao

121,970

Kavalan

132,412

Truku

105,872

Sakizaya

1,473,448

Seediq

1,035,877

Saaroa

79,458

Kanakanavu

108,446

Token Count per Corpus

Source

Tokens

Virginia Fey Dictionary

9,078 (Amis only)

ILRDF Dictionaries

659,295

NTU Paiwan ASR

68,332 (Paiwan only)

Presidential Apologies

29,793

Paiwan Stories

556 (Paiwan only)

Wikipedias

4,628,365

ePark

2,680,175

Rau and Dong

13,274 (Yami only)

100 Paiwan Stories

24,469 (Paiwan only)

Recorded Audio per Language

Language

Duration

Amis

72 hours 32 minutes 18 seconds

Atayal

87 hours 4 minutes 2 seconds

Paiwan

72 hours 1 minutes 24 seconds

Bunun

71 hours 13 minutes 28 seconds

Puyuma

71 hours 8 minutes 6 seconds

Rukai

88 hours 19 minutes 41 seconds

Tsou

21 hours 49 minutes 36 seconds

Saisiyat

23 hours 11 minutes 33 seconds

Yami

21 hours 41 minutes 46 seconds

Thao

21 hours 49 minutes 28 seconds

Kavalan

25 hours 41 minutes 1 second

Truku

22 hours 41 minutes 47 seconds

Sakizaya

24 hours 39 minutes 49 seconds

Seediq

49 hours 46 minutes 30 seconds

Saaroa

26 hours 23 minutes 26 seconds

Kanakanavu

31 hours 36 minutes 54 seconds

PreviousFormosan Dialects NextePark

Last updated 8 days ago

hashtagComing Soon

hashtagCurrent numbers

hashtagToken Count per Language

hashtagToken Count per Corpus

hashtagRecorded Audio per Language

Coming Soon

Current numbers

Token Count per Language

Token Count per Corpus

Recorded Audio per Language