FormosanBank
Last updated
Last updated
FormosanBank serves as a centralized repository for linguistic data across the 16 extant Formosan languages. The project aims to make these resources easily accessible for researchers, educators, and indigenous community members to facilitate research, education, and language preservation efforts. The materials in the corpus include:
Texts and Transcriptions: Digitized texts, transcribed audio, dictionaries, and reference grammars.
Audio Recordings: A range of spoken language materials, such as Indigenous news, talk shows, traditional stories, and interviews, covering diverse dialects and speaker demographics.
Annotated Corpora: Detailed linguistic annotations including word-level glosses, morpheme-level segmentation, phonological transcriptions, and translations in multiple languages.
Data collection is ongoing and encompasses a wide array of sources, such as dictionaries, historical documents, Indigenous media, and academic publications. The corpora currently being processed include materials like Amis and Paiwan YouTube videos, Indigenous talk shows, eBooks, and academic texts (see the current list of corpora ).
FormosanBank uses a standardized XML format developed by the Pangloss Collection to ensure consistency across the corpus. This format organizes data hierarchically, representing texts, sentences, words, and morphemes, with each level annotated as needed. The structure allows for optional elements like translations and audio references, making the data suitable for both linguistic analysis and computational processing.
The format includes attributes for essential metadata, such as citations, copyright information, and language codes, ensuring transparency and proper attribution. By adopting a consistent format, FormosanBank facilitates data sharing, analysis, and integration with other linguistic tools.
Each resource in FormosanBank has its own copyright and licensing terms, with most materials made available under Creative Commons licenses to encourage reuse and sharing. Licensing details, including copyright holders and usage restrictions, are provided on the individual resource pages. This ensures transparency and respect for intellectual property rights while making the data as accessible as possible.
FormosanBank’s resources are organized to support both linguistic analysis and community use. The data is structured with:
Metadata for easy search and retrieval, including language, dialect, speaker information, and source details.
APIs and downloadable datasets to facilitate computational research and integration with other tools.
The format's flexibility also allows researchers to annotate data at different levels of granularity, making FormosanBank a valuable resource for projects ranging from basic language documentation to advanced natural language processing.
FormosanBank is more than a research tool; it plays a crucial role in revitalizing endangered Formosan languages. By providing a well-organized and accessible collection of linguistic resources, FormosanBank supports:
Educational initiatives: Creating teaching materials and learning resources for use in schools and community programs.
Community-driven documentation projects: Assisting local communities in recording and preserving their languages.
Digital tool development: Enabling the creation of language technologies like speech recognition, machine translation, and language learning apps.
The project is continually evolving, with plans to expand the corpus, improve data quality, and incorporate new technologies for linguistic research. Collaboration with Indigenous communities, linguists, and technology experts remains a key component of FormosanBank’s growth, ensuring that the project remains community-oriented and culturally sensitive.