Technical Documents
Language Specific Peculiarities Document for Sheng as Spoken in Kenya
Language Specific Peculiarities Document for Sheng as Spoken in Kenya
Speech-to-text models
Congolese Swahili speech-to-text model
DeepSpeech models for Congolese Swahili language.
Congolese Swahili speech-to-text model
Bengali speech-to-text model
DeepSpeech models for Bengali language.
Bengali speech-to-text model
Hausa baseline speech-to-text model
Hausa baseline ASR model
Speech corpora
Congolese Swahili speech corpora
Audio mini-kit
5000 Audio samples recorded by 5 speakers. Sentences from Congolese Swahili mini kit. Format: WAV
Size: 11 hours
Congolese Swahili audio mini-kit
Congolese Swahili TICO-19 audio datasets
TICO-19 Congolese Swahili development and test sets recorded by a male and a female speaker.
Congolese Swahili TICO-19 audio development set
Congolese Swahili TICO-19 audio test set
Audio commands corpus
An audio corpus that consists of 5 speakers uttering numbers 1 to 10 and yes/no in Congolese Swahili.
Congolese Swahili audio commands corpus
Coastal Swahili speech corpus
Audio samples recorded by a Kenyan male speaker. Sentences from Swahili mini kit. Format: WAV
No. of samples: 4700
Swahili audio mini-kit
Parallel text corpora
Tigrinya – English parallel text corpora
English sentences are sourced from Tatoeba repository and then translated into Tigrinya.
No. of sentences 5000
Gamayun Mini kit 5k Tigrinya – English
Lingala – French parallel text corpora
French sentences are sourced from Tatoeba repository and then translated into Lingala.
No. of sentences 5000
Gamayun Mini kit 5k Lingala – French
Congolese Swahili – French parallel text corpora
French sentences are sourced from Tatoeba repository and then translated into Congolese Swahili.
No. of sentences 25305
Gamayun Mini kit 5k Congolese Swahili – French
Gamayun Small kit 10k Congolese Swahili – French
Synthetically produced Swahili-French parallel text corpora
English-paired and monolingual data converted to Swahili-French parallel corpus using machine translation.
No. of sentences 928,065
Back-translated Swahili-French 1M sentence parallel data
French – Nande parallel text corpora
French sentences are sourced from Tatoeba repository and then translated into Nande.
No. of sentences 15000
Gamayun Mini Kit 5k Nande – French
Gamayun Small kit 10k Nande – English
Colloquial Levantine Arabic parallel corpus
Posts shared on the Khabrona.Info Facebook page. 5052 parallel sentences with English translations, 658 monolingual sentences
TWB Khabrona.info dataset
English – Hausa parallel text corpora
French sentences are sourced from Tatoeba repository and then translated into Hausa.
No. of sentences 15000
Gamayun Mini kit 5k Hausa – English
Gamayun Small kit 10k Hausa – English
Gamayun Medium kit 15k Hausa – English
English – Rohingya parallel text corpus
English sentences are sourced from Tatoeba repository and then translated into Rohingya.
No. of sentences 5000
Gamayun Mini kit 5k Rohingya – English
English – Swahili parallel text corpus
French sentences are sourced from Tatoeba repository and then translated into Swahili.
No. of sentences 5000
Gamayun Mini kit 5k Swahili – English
English – Kanuri parallel text corpus
French sentences are sourced from Tatoeba repository and then translated into Kanuri.
No. of sentences 5000
Gamayun Mini kit 5k Kanuri – English
Plain text corpora
Kanuri books corpus
Shuffled sentences from books collected from four Kanuri authors