# Corpora

Welcome to the FormosanBank Corpora section! Here, you’ll find comprehensive documentation of the corpora used in FormosanBank. Each corpus in this collection represents a unique linguistic dataset, encompassing various types of text and audio recordings. Our corpora are designed to support linguistic research, language education, and revitalization efforts, making these endangered languages accessible and analyzable for researchers, educators, and community members alike. Below is a list of the current corpora of which FormosanBank consists:

* [ePark](/formosanbank/the-bank-architecture/corpora/epark.md) (text, audio, English, Mandarin)
* [ILRDF Dictionaries](/formosanbank/the-bank-architecture/corpora/ilrdf-dictionaries.md) (text, audio, Mandarin)
* [Wikipedias](/formosanbank/the-bank-architecture/corpora/wikipedias.md) (text)
* [Presidential Apologies](/formosanbank/the-bank-architecture/corpora/presidential-apologies.md) (text, English, Mandarin)
* [NTU Paiwan ASR](/formosanbank/the-bank-architecture/corpora/ntu-paiwan-asr.md) (text, audio)
* [Virginia Fey's Amis Dictionary](/formosanbank/the-bank-architecture/corpora/virginia-feys-amis-dictionary.md) (text, English, Mandarin)
* [Paiwan Stories](/formosanbank/the-bank-architecture/corpora/paiwan-stories.md) (text, audio, Mandarin)
* [Rau-Dong](/formosanbank/the-bank-architecture/corpora/raudong.md) (text, glossing, English, Mandarin)
* [The Montgomery Texts](/formosanbank/the-bank-architecture/corpora/montgomerytexts.md) (text, English)
* [The Wakelin Texts](/formosanbank/the-bank-architecture/corpora/wakelintexts.md) (text, English, segmented)
* [100 Paiwan Texts](/formosanbank/the-bank-architecture/corpora/hundredpaiwantexts.md) (text, glossing, English)
* [FormosanBank GitBook](/formosanbank/the-bank-architecture/corpora/formosanbankgitbook.md)
* [SEALS 33](/formosanbank/the-bank-architecture/corpora/seals33.md) (text, Mandarin, English)
* [Glosbe](/formosanbank/the-bank-architecture/corpora/glosbe.md) (text, Mandarin (Traditional), Mandarin (Simplified))
* [Whitehorn Collection](/formosanbank/the-bank-architecture/corpora/whitehorn-collection.md) (audio)
* [Yedda Palemeq Blog](/formosanbank/the-bank-architecture/corpora/yeddapalemeqblog.md) (text, English, segmentation, audio)
* [Siraya Gospels](/formosanbank/the-bank-architecture/corpora/siraya-gospels.md) (text, English, Mandarin, Dutch)
* [Wilang Yutas Videos](/formosanbank/the-bank-architecture/corpora/wilangyutasvideos.md) (text, Mandarin, audio)
* [Tang Recordings of Taroko](/formosanbank/the-bank-architecture/corpora/tangrecordingsoftaroko.md) (audio)
* [NTU Corpus of Formosan Languages](/formosanbank/the-bank-architecture/corpora/ntuformosancorpus.md) (text, audio, glossing)

***

### Corpus Statistics

#### By language

|                           | Truku  | Amis      | Bunun   | Kavalan | Rukai   | Siraya | Paiwan  | Puyuma  | Thao    | Saaroa | Sakizaya  | Yami    | Atayal  | Seediq  | Tsou   | Kanakanabu | Saisiyat |
| ------------------------- | ------ | --------- | ------- | ------- | ------- | ------ | ------- | ------- | ------- | ------ | --------- | ------- | ------- | ------- | ------ | ---------- | -------- |
| Word count                | 86,274 | 2,122,310 | 248,279 | 127,162 | 273,600 | 42,107 | 398,682 | 229,064 | 101,856 | 63,097 | 1,445,857 | 119,240 | 747,479 | 970,779 | 88,102 | 108,814    | 101,203  |
| Total audio               | 42.2h  | 60.8h     | 62.7h   | 26.9h   | 77.1h   | 0      | 74.9h   | 59.9h   | 20.7h   | 23.0h  | 24.5h     | 20.3h   | 77.8h   | 46.1h   | 20.5h  | 32.6h      | 23.5h    |
| Transcribed               | 20.8h  | 60.6h     | 62.7h   | 26.9h   | 77.1h   | 0      | 62.4h   | 59.9h   | 20.7h   | 23.0h  | 24.5h     | 20.3h   | 77.3h   | 46.0h   | 20.5h  | 32.6h      | 23.5h    |
| Untranscribed             | 21.4h  | 0.2h      | 0       | 0       | 0       | 0      | 12.5h   | 0       | 0       | 0      | 0         | 0       | 0.5h    | 0.1h    | 0      | 0          | 0        |
| Translated words          |        |           |         |         |         |        |         |         |         |        |           |         |         |         |        |            |          |
| English                   | 11,266 | 66,155    | 64,527  | 27,376  | 63,016  | 42,107 | 71,621  | 38,691  | 11,158  | 8,763  | 19,793    | 25,503  | 64,498  | 48,343  | 21,440 | 24,222     | 21,594   |
| Mandarin                  | 86,274 | 350,450   | 248,165 | 127,147 | 273,000 | 42,006 | 208,926 | 228,962 | 101,853 | 63,097 | 101,322   | 118,393 | 306,079 | 216,156 | 88,098 | 108,796    | 101,019  |
| Morphologically segmented | 0      | 7,561     | 22,867  | 16,247  | 17,893  | 0      | 30,096  | 0       | 0       | 0      | 12,975    | 14,145  | 5,855   | 21,937  | 10,117 | 18,687     | 10,816   |
| Glossed words             | 0      | 7,561     | 22,867  | 16,247  | 17,893  | 0      | 24,483  | 0       | 0       | 0      | 12,975    | 14,145  | 5,855   | 21,937  | 10,117 | 18,687     | 10,816   |

#### By dialect

|                           | <p>Amis<br>Coastal</p> | <p>Amis<br>Hengchun</p> | <p>Amis<br>Malan</p> | <p>Amis<br>NA</p> | <p>Amis<br>Southern</p> | <p>Amis<br>UK</p> | <p>Amis<br>Xiuguluan</p> | <p>Bunun<br>Junqun</p> | <p>Bunun<br>Kaqun</p> | <p>Bunun<br>Luanqun</p> | <p>Bunun<br>Tanqun</p> | <p>Bunun<br>Zhuoqun</p> | <p>Rukai<br>Dawu</p> | <p>Rukai<br>Dona</p> | <p>Rukai<br>Eastern</p> | <p>Rukai<br>Maolin</p> | <p>Rukai<br>Wanshan</p> | <p>Rukai<br>Wutai</p> | <p>Paiwan<br>Central</p> | <p>Paiwan<br>Eastern</p> | <p>Paiwan<br>NA</p> | <p>Paiwan<br>North Western</p> | <p>Paiwan<br>Northern</p> | <p>Paiwan<br>Sothern</p> | <p>Paiwan<br>Southern</p> | <p>Paiwan<br>UK</p> | <p>Puyuma<br>Jianhe</p> | <p>Puyuma<br>Nanwang</p> | <p>Puyuma<br>Xiqun</p> | <p>Puyuma<br>Zhiben</p> | <p>Atayal<br>FourSeasons</p> | <p>Atayal<br>NA</p> | <p>Atayal<br>Sekolik</p> | <p>Atayal<br>UK</p> | <p>Atayal<br>Wanda</p> | <p>Atayal<br>Wenshui</p> | <p>Atayal<br>YilanZeaol</p> | <p>Atayal<br>Zeaol</p> | <p>Seediq<br>DeluValley</p> | <p>Seediq<br>Duda</p> | <p>Seediq<br>NA</p> | <p>Seediq<br>Tegudaya</p> | <p>Seediq<br>UK</p> |
| ------------------------- | ---------------------- | ----------------------- | -------------------- | ----------------- | ----------------------- | ----------------- | ------------------------ | ---------------------- | --------------------- | ----------------------- | ---------------------- | ----------------------- | -------------------- | -------------------- | ----------------------- | ---------------------- | ----------------------- | --------------------- | ------------------------ | ------------------------ | ------------------- | ------------------------------ | ------------------------- | ------------------------ | ------------------------- | ------------------- | ----------------------- | ------------------------ | ---------------------- | ----------------------- | ---------------------------- | ------------------- | ------------------------ | ------------------- | ---------------------- | ------------------------ | --------------------------- | ---------------------- | --------------------------- | --------------------- | ------------------- | ------------------------- | ------------------- |
| Word count                | 66,343                 | 36,750                  | 36,137               | 1,863,357         | 36,328                  | 0                 | 83,395                   | 129,475                | 29,654                | 30,595                  | 29,721                 | 28,834                  | 29,992               | 28,565               | 30,151                  | 27,693                 | 24,350                  | 132,849               | 41,624                   | 50,297                   | 124,479             | 0                              | 127,754                   | 0                        | 53,423                    | 1,105               | 35,383                  | 116,881                  | 37,157                 | 39,643                  | 38,172                       | 417,534             | 138,146                  | 0                   | 31,056                 | 42,863                   | 40,229                      | 39,479                 | 39,861                      | 41,788                | 754,830             | 134,300                   | 0                   |
| Total audio               | 14.5h                  | 10.2h                   | 8.4h                 | 0                 | 9.2h                    | 0.2h              | 18.3h                    | 30.2h                  | 8.0h                  | 7.9h                    | 8.6h                   | 8.1h                    | 9.5h                 | 8.9h                 | 9.4h                    | 8.8h                   | 8.4h                    | 32.2h                 | 10.2h                    | 12.8h                    | 0                   | 7.9h                           | 27.7h                     | 0.0h                     | 11.8h                     | 4.5h                | 9.4h                    | 29.8h                    | 9.9h                   | 10.8h                   | 9.2h                         | 0                   | 28.5h                    | 0.5h                | 8.8h                   | 10.6h                    | 10.3h                       | 9.9h                   | 11.7h                       | 10.1h                 | 0                   | 24.3h                     | 0.1h                |
| Transcribed               | 14.5h                  | 10.2h                   | 8.4h                 | 0                 | 9.2h                    | 0                 | 18.3h                    | 30.2h                  | 8.0h                  | 7.9h                    | 8.6h                   | 8.1h                    | 9.5h                 | 8.9h                 | 9.4h                    | 8.8h                   | 8.4h                    | 32.2h                 | 10.2h                    | 12.8h                    | 0                   | 0                              | 27.7h                     | 0                        | 11.7h                     | 0                   | 9.4h                    | 29.8h                    | 9.9h                   | 10.8h                   | 9.2h                         | 0                   | 28.5h                    | 0                   | 8.8h                   | 10.6h                    | 10.3h                       | 9.9h                   | 11.7h                       | 10.1h                 | 0                   | 24.3h                     | 0                   |
| Untranscribed             | 0                      | 0                       | 0                    | 0                 | 0                       | 0.2h              | 0                        | 0                      | 0                     | 0                       | 0                      | 0                       | 0                    | 0                    | 0                       | 0                      | 0                       | 0                     | 0                        | 0                        | 0                   | 7.9h                           | 0                         | 0.0h                     | 0.1h                      | 4.5h                | 0                       | 0                        | 0                      | 0                       | 0                            | 0                   | 0                        | 0.5h                | 0                      | 0                        | 0                           | 0                      | 0                           | 0                     | 0                   | 0                         | 0.1h                |
| Translated words          |                        |                         |                      |                   |                         |                   |                          |                        |                       |                         |                        |                         |                      |                      |                         |                        |                         |                       |                          |                          |                     |                                |                           |                          |                           |                     |                         |                          |                        |                         |                              |                     |                          |                     |                        |                          |                             |                        |                             |                       |                     |                           |                     |
| English                   | 17,245                 | 9,440                   | 9,306                | 358               | 10,325                  | 0                 | 19,481                   | 30,929                 | 8,113                 | 8,824                   | 8,121                  | 8,540                   | 7,573                | 7,949                | 8,185                   | 7,318                  | 6,526                   | 25,465                | 14,034                   | 12,357                   | 0                   | 0                              | 19,011                    | 0                        | 25,114                    | 1,105               | 9,224                   | 9,436                    | 9,889                  | 10,142                  | 9,457                        | 0                   | 10,216                   | 0                   | 9,144                  | 16,467                   | 9,497                       | 9,717                  | 9,343                       | 10,020                | 136                 | 28,844                    | 0                   |
| Mandarin                  | 66,343                 | 36,750                  | 36,137               | 90,631            | 35,995                  | 0                 | 84,594                   | 129,383                | 29,654                | 30,595                  | 29,721                 | 28,812                  | 29,992               | 28,247               | 30,115                  | 27,529                 | 24,299                  | 132,818               | 34,761                   | 38,228                   | 0                   | 0                              | 97,864                    | 0                        | 38,073                    | 0                   | 35,365                  | 116,805                  | 37,149                 | 39,643                  | 38,138                       | 0                   | 114,367                  | 0                   | 31,008                 | 42,858                   | 40,229                      | 39,479                 | 39,740                      | 41,723                | 402                 | 134,291                   | 0                   |
| Morphologically segmented | 7,561                  | 0                       | 0                    | 0                 | 0                       | 0                 | 0                        | 22,867                 | 0                     | 0                       | 0                      | 0                       | 0                    | 0                    | 0                       | 0                      | 0                       | 17,893                | 3,727                    | 1,198                    | 0                   | 0                              | 9,007                     | 0                        | 15,059                    | 1,105               | 0                       | 0                        | 0                      | 0                       | 0                            | 0                   | 0                        | 0                   | 0                      | 5,855                    | 0                           | 0                      | 0                           | 0                     | 0                   | 21,937                    | 0                   |
| Glossed words             | 7,561                  | 0                       | 0                    | 0                 | 0                       | 0                 | 0                        | 22,867                 | 0                     | 0                       | 0                      | 0                       | 0                    | 0                    | 0                       | 0                      | 0                       | 17,893                | 3,727                    | 1,198                    | 0                   | 0                              | 9,007                     | 0                        | 9,446                     | 1,105               | 0                       | 0                        | 0                      | 0                       | 0                            | 0                   | 0                        | 0                   | 0                      | 5,855                    | 0                           | 0                      | 0                           | 0                     | 0                   | 21,937                    | 0                   |

***

### Coming Soon

In addition to the wide range of corpora already incorporated into FormosanBank, there is a large number of further corpora that permission to include in FormosanBank has been obtained and they are being processed at the moment. Below are some of these corpora:

* NTU Corpus (Various languages)
* Matthew's Gospel and John's Gospel (Siraya)
* The Sedik Language of Formosa by Erin Asai (Seediq)
* Chang's Seediq Reference Grammar (Seediq)
* Chang's Kavalan reference grammar (Kavalan)
* Poinsot Amis Dictionary (Amis)
* Moedict Amis (Amis)
* Asai's Seediq Language of Formosan (Seediq)
* Wilang Yutas videos (Atayal)
* ​hala saku la (videos - Atayal)
* ​hala saku la (text - Atayal)
* Tung's Descriptive study of Tsou (Tsou)
* Jeng (1992) Topic and focus in Bunun (Bunun)
* Blust's Thao Dictionary (Thao)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
