# Corpora

Welcome to the FormosanBank Corpora section! Here, you’ll find comprehensive documentation of the corpora used in FormosanBank. Each corpus in this collection represents a unique linguistic dataset, encompassing various types of text and audio recordings. Our corpora are designed to support linguistic research, language education, and revitalization efforts, making these endangered languages accessible and analyzable for researchers, educators, and community members alike. Below is a list of the current corpora of which FormosanBank consists:

* [ePark](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/epark) (text, audio, English, Mandarin)
* [ILRDF Dictionaries](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/ilrdf-dictionaries) (text, audio, Mandarin)
* [Wikipedias](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/wikipedias) (text)
* [Presidential Apologies](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/presidential-apologies) (text, English, Mandarin)
* [NTU Paiwan ASR](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/ntu-paiwan-asr) (text, audio)
* [Virginia Fey's Amis Dictionary](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/virginia-feys-amis-dictionary) (text, English, Mandarin)
* [Paiwan Stories](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/paiwan-stories) (text, audio, Mandarin)
* [Rau-Dong](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/raudong) (text, glossing, English, Mandarin)
* [The Montgomery Texts](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/montgomerytexts) (text, English)
* [The Wakelin Texts](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/wakelintexts) (text, English, segmented)
* [100 Paiwan Texts](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/hundredpaiwantexts) (text, glossing, English)
* [FormosanBank GitBook](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/formosanbankgitbook)
* [SEALS 33](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/seals33) (text, Mandarin, English)
* [Glosbe](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/glosbe) (text, Mandarin (Traditional), Mandarin (Simplified))
* [Whitehorn Collection](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/whitehorn-collection) (audio)
* [Yedda Palemeq Blog](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/yeddapalemeqblog) (text, English, segmentation, audio)
* [Siraya Gospels](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/siraya-gospels) (text, English, Mandarin, Dutch)
* [Wilang Yutas Videos](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/wilangyutasvideos) (text, Mandarin, audio)
* [Tang Recordings of Taroko](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora/tangrecordingsoftaroko) (audio)
* [NTU Corpus of Formosan Languages](https://github.com/FormosanBank/FormosanBankGitbook/blob/en-us/en-us/the-bank-architecture/corpora/NTUFormosanCorpus.md) (text, audio, glossing)

***

### Corpus Statistics

#### By language

|                           | Truku  | Amis      | Bunun   | Kavalan | Rukai   | Siraya | Paiwan  | Puyuma  | Thao    | Saaroa | Sakizaya  | Yami    | Atayal  | Seediq  | Tsou   | Kanakanabu | Saisiyat |
| ------------------------- | ------ | --------- | ------- | ------- | ------- | ------ | ------- | ------- | ------- | ------ | --------- | ------- | ------- | ------- | ------ | ---------- | -------- |
| Word count                | 86,274 | 2,122,310 | 248,279 | 127,162 | 273,600 | 42,107 | 398,682 | 229,064 | 101,856 | 63,097 | 1,445,857 | 119,240 | 747,479 | 970,779 | 88,102 | 108,814    | 101,203  |
| Total audio               | 42.2h  | 60.8h     | 62.7h   | 26.9h   | 77.1h   | 0      | 74.9h   | 59.9h   | 20.7h   | 23.0h  | 24.5h     | 20.3h   | 83.5h   | 46.1h   | 20.5h  | 32.6h      | 23.5h    |
| Transcribed               | 20.8h  | 60.6h     | 62.7h   | 26.9h   | 77.1h   | 0      | 62.4h   | 59.9h   | 20.7h   | 23.0h  | 24.5h     | 20.3h   | 77.3h   | 46.0h   | 20.5h  | 32.6h      | 23.5h    |
| Untranscribed             | 21.4h  | 0.2h      | 0       | 0       | 0       | 0      | 12.5h   | 0       | 0       | 0      | 0         | 0       | 6.2h    | 0.1h    | 0      | 0          | 0        |
| Translated sentences      |        |           |         |         |         |        |         |         |         |        |           |         |         |         |        |            |          |
| English                   | 2,187  | 13,870    | 13,759  | 4,544   | 15,175  | 1,951  | 12,357  | 8,687   | 2,155   | 2,183  | 3,105     | 3,062   | 13,575  | 8,135   | 3,132  | 4,598      | 3,497    |
| Mandarin                  | 11,995 | 47,146    | 43,009  | 18,666  | 51,172  | 1,947  | 32,953  | 35,349  | 13,902  | 11,902 | 14,850    | 14,712  | 46,182  | 28,402  | 11,518 | 17,402     | 14,820   |
| Morphologically segmented | 0      | 756       | 2,975   | 2,394   | 2,252   | 0      | 3,567   | 0       | 0       | 0      | 1,530     | 960     | 489     | 2,098   | 949    | 3,013      | 1,294    |
| Proportion glossed        | 0%     | 100%      | 100%    | 100%    | 100%    | 0%     | 82%     | 0%      | 0%      | 0%     | 100%      | 100%    | 100%    | 100%    | 100%   | 100%       | 100%     |

#### By dialect

|                           | <p>Amis<br>Coastal</p> | <p>Amis<br>Hengchun</p> | <p>Amis<br>Malan</p> | <p>Amis<br>Southern</p> | <p>Amis<br>UK</p> | <p>Amis<br>Xiuguluan</p> | <p>Bunun<br>Junqun</p> | <p>Bunun<br>Kaqun</p> | <p>Bunun<br>Luanqun</p> | <p>Bunun<br>Tanqun</p> | <p>Bunun<br>Zhuoqun</p> | <p>Rukai<br>Dawu</p> | <p>Rukai<br>Dona</p> | <p>Rukai<br>Eastern</p> | <p>Rukai<br>Maolin</p> | <p>Rukai<br>Wanshan</p> | <p>Rukai<br>Wutai</p> | <p>Paiwan<br>Central</p> | <p>Paiwan<br>Eastern</p> | <p>Paiwan<br>North Western</p> | <p>Paiwan<br>Northern</p> | <p>Paiwan<br>Sothern</p> | <p>Paiwan<br>Southern</p> | <p>Paiwan<br>UK</p> | <p>Puyuma<br>Jianhe</p> | <p>Puyuma<br>Nanwang</p> | <p>Puyuma<br>Xiqun</p> | <p>Puyuma<br>Zhiben</p> | <p>Atayal<br>FourSeasons</p> | <p>Atayal<br>Sekolik</p> | <p>Atayal<br>UK</p> | <p>Atayal<br>Wanda</p> | <p>Atayal<br>Wenshui</p> | <p>Atayal<br>YilanZeaol</p> | <p>Atayal<br>Zeaol</p> | <p>Seediq<br>DeluValley</p> | <p>Seediq<br>Duda</p> | <p>Seediq<br>Tegudaya</p> | <p>Seediq<br>UK</p> |
| ------------------------- | ---------------------- | ----------------------- | -------------------- | ----------------------- | ----------------- | ------------------------ | ---------------------- | --------------------- | ----------------------- | ---------------------- | ----------------------- | -------------------- | -------------------- | ----------------------- | ---------------------- | ----------------------- | --------------------- | ------------------------ | ------------------------ | ------------------------------ | ------------------------- | ------------------------ | ------------------------- | ------------------- | ----------------------- | ------------------------ | ---------------------- | ----------------------- | ---------------------------- | ------------------------ | ------------------- | ---------------------- | ------------------------ | --------------------------- | ---------------------- | --------------------------- | --------------------- | ------------------------- | ------------------- |
| Word count                | 66,343                 | 36,750                  | 36,137               | 36,328                  | 1,863,357         | 83,395                   | 129,475                | 29,654                | 30,595                  | 29,721                 | 28,834                  | 29,992               | 28,565               | 30,151                  | 27,693                 | 24,350                  | 132,849               | 41,624                   | 50,297                   | 0                              | 127,754                   | 0                        | 53,423                    | 125,584             | 35,383                  | 116,881                  | 37,157                 | 39,643                  | 38,172                       | 138,146                  | 417,534             | 31,056                 | 42,863                   | 40,229                      | 39,479                 | 39,861                      | 41,788                | 134,300                   | 754,830             |
| Total audio               | 14.5h                  | 10.2h                   | 8.4h                 | 9.2h                    | 0.2h              | 18.3h                    | 30.2h                  | 8.0h                  | 7.9h                    | 8.6h                   | 8.1h                    | 9.5h                 | 8.9h                 | 9.4h                    | 8.8h                   | 8.4h                    | 32.2h                 | 10.2h                    | 12.8h                    | 7.9h                           | 27.7h                     | 0.0h                     | 11.8h                     | 4.5h                | 9.4h                    | 29.8h                    | 9.9h                   | 10.8h                   | 9.2h                         | 34.2h                    | 0.5h                | 8.8h                   | 10.6h                    | 10.3h                       | 9.9h                   | 11.7h                       | 10.1h                 | 24.3h                     | 0.1h                |
| Transcribed               | 14.5h                  | 10.2h                   | 8.4h                 | 9.2h                    | 0                 | 18.3h                    | 30.2h                  | 8.0h                  | 7.9h                    | 8.6h                   | 8.1h                    | 9.5h                 | 8.9h                 | 9.4h                    | 8.8h                   | 8.4h                    | 32.2h                 | 10.2h                    | 12.8h                    | 0                              | 27.7h                     | 0                        | 11.7h                     | 0                   | 9.4h                    | 29.8h                    | 9.9h                   | 10.8h                   | 9.2h                         | 28.5h                    | 0                   | 8.8h                   | 10.6h                    | 10.3h                       | 9.9h                   | 11.7h                       | 10.1h                 | 24.3h                     | 0                   |
| Untranscribed             | 0                      | 0                       | 0                    | 0                       | 0.2h              | 0                        | 0                      | 0                     | 0                       | 0                      | 0                       | 0                    | 0                    | 0                       | 0                      | 0                       | 0                     | 0                        | 0                        | 7.9h                           | 0                         | 0.0h                     | 0.1h                      | 4.5h                | 0                       | 0                        | 0                      | 0                       | 0                            | 5.7h                     | 0.5h                | 0                      | 0                        | 0                           | 0                      | 0                           | 0                     | 0                         | 0.1h                |
| Translated sentences      |                        |                         |                      |                         |                   |                          |                        |                       |                         |                        |                         |                      |                      |                         |                        |                         |                       |                          |                          |                                |                           |                          |                           |                     |                         |                          |                        |                         |                              |                          |                     |                        |                          |                             |                        |                             |                       |                           |                     |
| English                   | 2,910                  | 2,155                   | 2,144                | 2,158                   | 40                | 4,463                    | 5,132                  | 2,154                 | 2,163                   | 2,154                  | 2,156                   | 2,151                | 2,154                | 2,149                   | 2,158                  | 2,150                   | 4,413                 | 2,721                    | 2,400                    | 0                              | 3,197                     | 0                        | 3,968                     | 71                  | 2,153                   | 2,150                    | 2,229                  | 2,155                   | 2,159                        | 2,309                    | 0                   | 2,154                  | 2,644                    | 2,160                       | 2,149                  | 2,159                       | 2,158                 | 3,802                     | 16                  |
| Mandarin                  | 8,276                  | 6,325                   | 6,276                | 6,076                   | 5,860             | 14,333                   | 18,880                 | 6,109                 | 6,142                   | 5,932                  | 5,946                   | 6,283                | 6,193                | 6,031                   | 6,149                  | 6,295                   | 20,221                | 6,467                    | 6,387                    | 0                              | 13,460                    | 0                        | 6,639                     | 0                   | 6,502                   | 15,618                   | 6,581                  | 6,648                   | 6,714                        | 14,404                   | 0                   | 5,896                  | 6,523                    | 6,276                       | 6,369                  | 6,367                       | 6,264                 | 15,743                    | 28                  |
| Morphologically segmented | 756                    | 0                       | 0                    | 0                       | 0                 | 0                        | 2,975                  | 0                     | 0                       | 0                      | 0                       | 0                    | 0                    | 0                       | 0                      | 0                       | 2,252                 | 537                      | 157                      | 0                              | 995                       | 0                        | 1,807                     | 71                  | 0                       | 0                        | 0                      | 0                       | 0                            | 0                        | 0                   | 0                      | 489                      | 0                           | 0                      | 0                           | 0                     | 2,098                     | 0                   |
| Proportion glossed        | 100%                   | 0%                      | 0%                   | 0%                      | 0%                | 0%                       | 100%                   | 0%                    | 0%                      | 0%                     | 0%                      | 0%                   | 0%                   | 0%                      | 0%                     | 0%                      | 100%                  | 100%                     | 100%                     | 0%                             | 100%                      | 0%                       | 64%                       | 100%                | 0%                      | 0%                       | 0%                     | 0%                      | 0%                           | 0%                       | 0%                  | 0%                     | 100%                     | 0%                          | 0%                     | 0%                          | 0%                    | 100%                      | 0%                  |

***

### Coming Soon

In addition to the wide range of corpora already incorporated into FormosanBank, there is a large number of further corpora that permission to include in FormosanBank has been obtained and they are being processed at the moment. Below are some of these corpora:

* NTU Corpus (Various languages)
* Matthew's Gospel and John's Gospel (Siraya)
* The Sedik Language of Formosa by Erin Asai (Seediq)
* Chang's Seediq Reference Grammar (Seediq)
* Chang's Kavalan reference grammar (Kavalan)
* Poinsot Amis Dictionary (Amis)
* Moedict Amis (Amis)
* Asai's Seediq Language of Formosan (Seediq)
* Wilang Yutas videos (Atayal)
* ​hala saku la (videos - Atayal)
* ​hala saku la (text - Atayal)
* Tung's Descriptive study of Tsou (Tsou)
* Jeng (1992) Topic and focus in Bunun (Bunun)
* Blust's Thao Dictionary (Thao)
