Back-translated Swahili-French 1M sentence parallel data

    0
    43
    • Version
    • Download 41
    • File Size 78.72 MB
    • File Count 1
    • Create Date April 15, 2021
    • Last Updated April 15, 2021

    Back-translated Swahili-French 1M sentence parallel data

    Synthetic data used in the experiments of the paper "Congolese Swahili Machine Translation for Humanitarian Response" published in Africa NLP workshop organized within the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL2021)

    Paper link: https://arxiv.org/abs/2103.10734

    To cite:

    @misc{oktem2021congolese,
    title={Congolese Swahili Machine Translation for Humanitarian Response},
    author={Alp Öktem and Eric DeLuca and Rodrigue Bashizi and Eric Paquin and Grace Tang},
    booktitle={Africa NLP workshop organized within the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL2021)},
    year={2021},
    month=april,
    url={https://arxiv.org/abs/2103.10734}
    }

    Pair-converted data

    This portion of sythetic data was originally Swahili-English. English part is machine translated to French to help making Congolese Swahili - French translation models.

    Monolingual data

    This corpus is built on two monolingual Swahili corpora. Sentences were machine translated to French using our models.

    Used data sources:

    GourMeT corpus - https://gourmet-project.eu/publications-deliverables/

    ELRC part of OPUS - the open collection of parallel corpora http://opus.nlpl.eu

    TICO-19 - https://tico-19.github.io/

    Gamayun Swahili minikit - https://gamayun.translatorswb.org/data/

    SWWiki - http://kevindonnelly.org.uk/swahili/swwiki/

    Leave a reply

    Please enter your comment!
    Please enter your name here