FormosanBank
English
English
  • Welcome
  • Background
    • Formosan Languages
    • Why Formosan?
    • FormosanBank
    • Contributors
  • The Bank Architecture
    • FormosanBank XML Format
    • Formosan Dialects
    • Corpora
      • ePark
      • ILRDF Dictionaries
      • Wikipedias
      • Presidential Apologies
      • NTU Paiwan ASR
      • Virginia Fey's Amis Dictionary
      • Paiwan Stories
    • Developers
      • 🤗HuggingFace
      • Folder structure
  • Additional Resources
    • Newsletters
      • October 2024
      • Septemper 2023
    • Publications
    • Terms of Use
    • Contributing to FormosanBank
Powered by GitBook
On this page
  • Overview
  • Corpus Processing
  • Applications
  • Access Details
  • Copyright
  • Acknowledgments
  • Citation
  1. The Bank Architecture
  2. Corpora

NTU Paiwan ASR

PreviousPresidential ApologiesNextVirginia Fey's Amis Dictionary

Last updated 3 months ago

Overview

The NTU Paiwan ASR Corpus is a linguistically rich dataset created as part of the efforts to enrich FormosanBank. The corpus was developed through a collaborative effort with one of the main collaborators of FormosanBank, from National Taiwan University and represents a pioneering contribution.

This corpus includes mostly read speech aligned with its corresponding text in addition to a small collection of spontaneous speech recordings. It is designed to support language preservation, linguistic research, and, notably, the development of automated speech recognition (ASR) tools tailored specifically for Paiwan.

Content Summary

  • Language Covered: Paiwan

  • Corpus Content:

    • Texts with aligned audio for read speech

    • Spontaneous speech recordings

  • Data Contributors: Six speakers (pseudonyms used to maintain anonymity)

Data Summary

Metric
Read Speech
Spontaneous Speech
Total

Topics

11

51

62

Recordings

16

98

114

Duration

1:24:10

9:10:33

10:34:43

Data Breakdown by Participant

Below is a detailed description of the data outlying for each speaker how many recordings are available of them and the duration they have spoken for. As mentioned above, participants' names are pseudonyms for privacy.

Speaker
Number of Recordings (N)
Total Duration

Loris

23

2:05:16

Zendar

35

3:01:01

Nira

30

2:29:31

Belmira

8

1:00:57

Falin

7

0:51:48

Sarnix

11

1:06:10


Corpus Processing

  1. XML Conversion

    • the transcribed text files were processed to create the XML files out of, ensuring compatibility with other corpora in FormosanBank.

    • Each topic for each of the participant represent on file, and each file contained the following elements:

      • <FORM>: Containing the original sentence in Paiwan

      • <AUDIO>: Name of the audio file associated with the sentence

  2. Cleaning and Standardization

    • XML files underwent further cleaning to remove empty elements and standardize punctuation and orthography.

    • HTML escape codes were replaced with the corresponding characters, and a kindOf="standard" attribute was added to <FORM> elements. The original text would be <FORM kindOf="original"> and any text that will be standardized from this point will be in <FORM kindOf="standard">

  3. Quality Control

    • Several QC steps were performed, including:

      • Cross-referencing word lists and API results to ensure all entries were successfully processed.

      • Validation of XML structure to confirm compliance with the FormosanBank schema.

      • Manual checks of random samples to verify data integrity.


Applications

The NTU Paiwan ASR Corpus is a crucial resource for advancing research and technology in several key areas:

  • Automated Speech Recognition (ASR): The aligned text and audio make this corpus particularly valuable for developing ASR systems tailored to Paiwan, enabling future tools like transcription software and voice interfaces in the language.

  • Language Revitalization: Supports the creation of educational materials and language-learning resources for Paiwan speakers and learners.

  • Speech Technology Development: Facilitates tools such as text-to-speech systems and pronunciation modeling.

  • Linguistic Analysis: Enables studies on syntax, phonology, discourse structures, and other linguistic phenomena.

  • Comparative Studies: Supports research on shared features and differences among Formosan languages.


Access Details


Copyright

Since this corpus was collected as part of Formosan Bank in the first place, the copyright restrictions applied to it are the ones governing FormosanBank (see below).


Acknowledgments

This corpus was developed through a collaborative effort led by Dr. Sung at the NTU Graduate Institute of Linguistics. The project is part of the broader FormosanBank initiative and would not have been possible without the contributions of Paiwan-speaking participants. Special thanks to all collaborators and the broader Paiwan community for their support and engagement in this vital preservation effort.


Citation

  • Le Ferrand, É., Prud'hommeaux, E., Hartshorne, J. K., & Sung, L.-M. (2024). NTU Paiwan ASR Corpus. Electronic Resource.

The data was processed thoroughly to ensure compatibility with FormosanBank standards while ensuring accuracy and usability. The corpus collected by Dr. Sung was structured by associating each topic with its corresponding audio recordings and transcriptions (for read speech) for each speaker. This data was then processed to transform the data into . Below is a detailed description of the process.

Character Frequency Analysis to spot any abnormalities in the orthography with as well as as references.

The corpus as part of FormosanBank can be accessed from .

The repo containing the ePark corpus in FormosanBank as well as the code to reconstruct the corpus can be found .

In accordance with our , if you use this corpus or any product derived from this corpus in any publication, you must cite both FormosanBank and:

Dr. Sung
FormosanBank XML Format
the official orthography
the ILRDF Dictionaries Corpus
here
here
Terms of Use