Literature Review Notes | TICO-19
20 September 2020Translation Initiative for COvid-19
With a focus on low-resource languages, TICO-19 aims to make available data in 35 different lanuages (of which 26 lesser resourced languages in addition to the 9 βpivotβ languages) for researchers and to assist the development of tools to effectively disseminate critical information for disaster situations such as presented by the current global pandemic.
Dataset development methodology
TICO-19 was created with the following benchmark:
- Sampling from a variety of public sources - such as PubMed, Wikipedia (and related applications), CMU English-Haitian Creole dataset, etc. - covering information domains including but not restricted to: news, testing, travel advisories, scientific development.
- Grouping languages into three main categories:
- Pivots: resource-rich languages which sometimes serve as an intermediary between English and a target language
- Priority: 18 languages classified as high-priority by the TWB due to the overwhelming requests they are receiving & βthe strategic location of their partners (e.g. the Red Cross)β
- Important: 8 languages spoken (by millions) in South and South-East Asia
- Ensuring the quality of translation through a two-step human quality control process:
- Each document is translated by lanaguage service providers (LSP) and edited
- A selected fraction of the data (and all content from PubMed since it is the hardest to translate) goes through another round of inspection
Developer resources
Additional COVID-19 data are made available in forms of monolingual & parallel corpus gathered from Wikipedia, BBC, VOA, and NGOs, national/state government sources, respectively.
Discussion & future work
-
Aside from the baseline results, note that there is a lack of pre-trained MT systems on the low-source languages such as Dari, Myanmar, Oromo, isiZulu, etc. This, along with the difference in BLEU scores between pivot & non-pivot languages highlights the dire need to collect more data for under-represented communities and languages.
-
While current MT systems are mostly trained on general-purpose data or particular out-of-domain data, domain adaptation techniques should improve performance. Incorporating terminologies previously unseen in the vocabulary could also help ensure the quality of translation.
-
Multilingual NMT systems trained on massive corpus from the web are also promising in improving translation quality for languages in the lower end of data availability.
-
Further efforts to increase representation of low-resource lanagues in the public domain corpora in order to prepare for future crises where translation technologies will be needed.
Read more about the TICO-19 effort here.