Technical Documents
Language Specific Peculiarities Document for Sheng as Spoken in Kenya
Speech-to-text models
Congolese Swahili speech-to-text model
DeepSpeech models for Congolese Swahili language.
Bengali speech-to-text model
DeepSpeech models for Bengali language.
Hausa baseline speech-to-text model
Speech corpora
Congolese Swahili speech corpora
Audio mini-kit
5000 Audio samples recorded by 5 speakers. Sentences from Congolese Swahili mini kit. Format: WAV
Size: 11 hours
Congolese Swahili TICO-19 audio datasets
TICO-19 Congolese Swahili development and test sets recorded by a male and a female speaker.
Audio commands corpus
An audio corpus that consists of 5 speakers uttering numbers 1 to 10 and yes/no in Congolese Swahili.
Coastal Swahili speech corpus
Audio samples recorded by a Kenyan male speaker. Sentences from Swahili mini kit. Format: WAV
No. of samples: 4700
Parallel text corpora
Tigrinya – English parallel text corpora
English sentences are sourced from Tatoeba repository and then translated into Tigrinya.
No. of sentences 5000
Lingala – French parallel text corpora
French sentences are sourced from Tatoeba repository and then translated into Lingala.
No. of sentences 5000
Congolese Swahili – French parallel text corpora
French sentences are sourced from Tatoeba repository and then translated into Congolese Swahili.
No. of sentences 25305
Synthetically produced Swahili-French parallel text corpora
English-paired and monolingual data converted to Swahili-French parallel corpus using machine translation.
No. of sentences 928,065
Back-translated Swahili-French 1M sentence parallel data
French – Nande parallel text corpora
French sentences are sourced from Tatoeba repository and then translated into Nande.
No. of sentences 15000
Colloquial Levantine Arabic parallel corpus
Posts shared on the Khabrona.Info Facebook page. 5052 parallel sentences with English translations, 658 monolingual sentences
English – Hausa parallel text corpora
French sentences are sourced from Tatoeba repository and then translated into Hausa.
No. of sentences 15000
English – Rohingya parallel text corpus
English sentences are sourced from Tatoeba repository and then translated into Rohingya.
No. of sentences 5000
English – Swahili parallel text corpus
French sentences are sourced from Tatoeba repository and then translated into Swahili.
No. of sentences 5000
English – Kanuri parallel text corpus
French sentences are sourced from Tatoeba repository and then translated into Kanuri.
No. of sentences 5000
Plain text corpora
Kanuri books corpus
Shuffled sentences from books collected from four Kanuri authors