Glosbe
Overview
This corpus comes from the crowdsourced online dictionary Glosbe. Currently we have only included Amis, which was (at time of checking) the only reasonably sized dictionary.
Corpus Statistics
Amis UK
Word count
90,631
Total audio
0
Transcribed
0
Untranscribed
0
Translated sentences
English
0
Mandarin
5,860
Morphologically segmented
0
Proportion glossed
0%
Corpus Processing
The process involves several steps to extract, scrape, clean and format the data:
Extract Common Words:
Output: Generates list of most frequent Amis words from reference corpus
Scrape Translations:
Output: Raw translation pairs saved to JSON
Deduplicate Translations:
Output: Cleaned JSON file with unique translation pairs
Convert to XML Format:
Output: FormosanBank XML format in Final_XML/amis_glosbe.xml
Validate XML (Optional):
Output: Validation report of XML structure
Clean XML (Optional):
Output: Standardized punctuation and cleaned XML
Update XML
Replace-all:
->
Add Traditional Chinese
This was done semi-automatically. It's not easily reproducible.
Standardize
It looks like it's Ortho94 (mostly). But conversion won't change anything relevant.
Remove some colons
Colons are used for introducing quotes in this text. However, colons have a specific meaning in the standard orthography, so replace with commas.
Add IPA
The IPA for Ortho94 is different from Ortho113, so go ahead and use it for the "original" tier.
Corpus Notes
Because this is a crowd-sourced dictionary, its accuracy is uncertain.
Access Details
The repo containing the Glosbe corpus in FormosanBank as well as the code to reconstruct the corpus can be found here.
Copyright
CC-BY-SA
Citation
In accordance with our Terms of Use, if you use this corpus or any product derived from this corpus in any publication, you must cite both FormosanBank and:
Glosbe. (n.d.). Glosbe dictionary. https://glosbe.com. Accessed 2025.
Last updated