# Corpora

Welcome to the FormosanBank Corpora section! Here, you’ll find comprehensive documentation of the corpora used in FormosanBank. Each corpus in this collection represents a unique linguistic dataset, encompassing various types of text and audio recordings. Our corpora are designed to support linguistic research, language education, and revitalization efforts, making these endangered languages accessible and analyzable for researchers, educators, and community members alike. Below is a list of the current corpora of which FormosanBank consists:

* [ePark](/formosanbank/paiwan/the-bank-architecture/corpora/epark.md)
* [ILRDF Dictionaries](/formosanbank/paiwan/the-bank-architecture/corpora/ilrdf-dictionaries.md)
* [Wikipedias](/formosanbank/paiwan/the-bank-architecture/corpora/wikipedias.md)
* [Presidential Apologies](/formosanbank/paiwan/the-bank-architecture/corpora/presidential-apologies.md)
* [NTU Paiwan ASR](/formosanbank/paiwan/the-bank-architecture/corpora/ntu-paiwan-asr.md)
* [Virginia Fey's Amis Dictionary](/formosanbank/paiwan/the-bank-architecture/corpora/virginia-feys-amis-dictionary.md)
* [Paiwan Stories](/formosanbank/paiwan/the-bank-architecture/corpora/paiwan-stories.md)

***

### Coming Soon

In addition to the wide range of corpora already incorporated into FormosanBank, there is a large number of further corpora that permission to include in FormosanBank has been obtained and they are being processed at the moment. Below are some of these corpora:

* NTU Corpus (Various languages)
* Glosbe (Amis, Truku, Atayal, Saisayat)&#x20;
* Xuan's books (Paiwan)
* Amis Texts - Montgomery (Amis)
* Wakelin (1958) Yami texts (Yami)
* Matthew's Gospel and John's Gospel (Siraya)
* 100 Paiwan Texts (Paiwan)
* The Sedik Language of Formosa by Erin Asai (Seediq)
* Chang's Seediq Reference Grammar (Seediq)
* Chang's Kavalan reference grammar (Kavalan)
* Poinsot Amis Dictionary (Amis)
* Moedict Amis (Amis)
* Whitehorn Collection (Paiwan, Amis, Atayal)
* Asai's Seediq Language of Formosan (Seediq)
* Wilang Yutas videos (Atayal)
* ​hala saku la (videos - Atayal)
* ​hala saku la (text - Atayal)
* Tung's Descriptive study of Tsou (Tsou)
* Jeng (1992) Topic and focus in Bunun (Bunun)
* Blust's Thao Dictionary (Thao)
* Rau & Dong (2006) Yami texts with reference grammar and dictionary (Yami)

***

### Current numbers

As of the time you're reading this, FormosanBank contains over 8 million tokens (precisely **8075594**) as well as **731 hours and 40 minutes** of recorded audio across the 16 Formosan languages. Below is a breakdown of the most up-to-date token count based on language and corpus as well as audio breakdown by language.

#### Token Count per Language

| **Language** | **Tokens** |
| ------------ | ---------- |
| Amis         | 2,213,003  |
| Atayal       | 907,763    |
| Paiwan       | 492,056    |
| Bunun        | 318,422    |
| Puyuma       | 340,520    |
| Rukai        | 358,879    |
| Tsou         | 99,694     |
| Saisiyat     | 109,512    |
| Yami         | 128,404    |
| Thao         | 121,970    |
| Kavalan      | 132,412    |
| Truku        | 115,948    |
| Sakizaya     | 1,504,757  |
| Seediq       | 1,044,350  |
| Saaroa       | 79,458     |
| Kanakanavu   | 108,446    |

***

#### Token Count per Corpus

| **Source**              | **Tokens**           |
| ----------------------- | -------------------- |
| Virginia Fey Dictionary | 9,078 (Amis only)    |
| ILRDF Dictionaries      | 659,295              |
| NTU Paiwan ASR          | 68,332 (Paiwan only) |
| Presidential Apologies  | 29,793               |
| Paiwan Stories          | 556 (Paiwan only)    |
| Wikipedias              | 4,628,365            |
| ePark                   | 2,680,175            |

#### Recorded Audio per Language

| Language   | Duration                       |
| ---------- | ------------------------------ |
| Amis       | 72 hours 32 minutes 18 seconds |
| Atayal     | 87 hours 4 minutes 2 seconds   |
| Paiwan     | 72 hours 1 minutes 24 seconds  |
| Bunun      | 71 hours 13 minutes 28 seconds |
| Puyuma     | 71 hours 8 minutes 6 seconds   |
| Rukai      | 88 hours 19 minutes 41 seconds |
| Tsou       | 21 hours 49 minutes 36 seconds |
| Saisiyat   | 23 hours 11 minutes 33 seconds |
| Yami       | 21 hours 41 minutes 46 seconds |
| Thao       | 21 hours 49 minutes 28 seconds |
| Kavalan    | 25 hours 41 minutes 1 second   |
| Truku      | 22 hours 41 minutes 47 seconds |
| Sakizaya   | 24 hours 39 minutes 49 seconds |
| Seediq     | 49 hours 46 minutes 30 seconds |
| Saaroa     | 26 hours 23 minutes 26 seconds |
| Kanakanavu | 31 hours 36 minutes 54 seconds |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ai4commsci.gitbook.io/formosanbank/paiwan/the-bank-architecture/corpora.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
