Synthetic data used in the experiments of the paper "Congolese Swahili Machine Translation for Humanitarian Response" published in Africa NLP workshop organized within the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL2021)
Paper link: https://arxiv.org/abs/2103.10734
To cite:
@misc{oktem2021congolese,
title={Congolese Swahili Machine Translation for Humanitarian Response},
author={Alp Öktem and Eric DeLuca and Rodrigue Bashizi and Eric Paquin and Grace Tang},
booktitle={Africa NLP workshop organized within the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL2021)},
year={2021},
month=april,
url={https://arxiv.org/abs/2103.10734}
}
Pair-converted data
This portion of sythetic data was originally Swahili-English. English part is machine translated to French to help making Congolese Swahili - French translation models.
Monolingual data
This corpus is built on two monolingual Swahili corpora. Sentences were machine translated to French using our models.
Used data sources:
GourMeT corpus - https://gourmet-project.eu/publications-deliverables/
ELRC part of OPUS - the open collection of parallel corpora http://opus.nlpl.eu
TICO-19 - https://tico-19.github.io/
Gamayun Swahili minikit - https://gamayun.translatorswb.org/data/
SWWiki - http://kevindonnelly.org.uk/swahili/swwiki/