Glosbe

Overview

This corpus comes from the crowdsourced online dictionary Glosbe. Currently we have only included Amis, which was (at time of checking) the only reasonably sized dictionary.


Corpus Statistics

Amis UK

Word count

90,631

Total audio

0

Transcribed

0

Untranscribed

0

Translated sentences

English

0

Mandarin

5,860

Morphologically segmented

0

Proportion glossed

0%


Corpus Processing

The process involves several steps to extract, scrape, clean and format the data:

  1. Extract Common Words:

Output: Generates list of most frequent Amis words from reference corpus

  1. Scrape Translations:

Output: Raw translation pairs saved to JSON

  1. Deduplicate Translations:

Output: Cleaned JSON file with unique translation pairs

  1. Convert to XML Format:

Output: FormosanBank XML format in Final_XML/amis_glosbe.xml

  1. Validate XML (Optional):

Output: Validation report of XML structure

  1. Clean XML (Optional):

Output: Standardized punctuation and cleaned XML

  1. Update XML

Replace-all:

->

  1. Add Traditional Chinese

This was done semi-automatically. It's not easily reproducible.

  1. Standardize

It looks like it's Ortho94 (mostly). But conversion won't change anything relevant.

  1. Remove some colons

Colons are used for introducing quotes in this text. However, colons have a specific meaning in the standard orthography, so replace with commas.

  1. Add IPA

The IPA for Ortho94 is different from Ortho113, so go ahead and use it for the "original" tier.


Corpus Notes

Because this is a crowd-sourced dictionary, its accuracy is uncertain.


Access Details

  • The repo containing the Glosbe corpus in FormosanBank as well as the code to reconstruct the corpus can be found herearrow-up-right.


CC-BY-SA


Citation

In accordance with our Terms of Use, if you use this corpus or any product derived from this corpus in any publication, you must cite both FormosanBank and:

  • Glosbe. (n.d.). Glosbe dictionary. https://glosbe.com. Accessed 2025.

Last updated