Programming code abstract - voice search datasets
PHOTO: Shutterstock

Mozilla gave users an early holiday gift in November 2017 when it introduced an initial release of its open-source speech recognition model. The search engine said in a blog post that this model has an accuracy approaching what humans can perceive when listening to the same recordings. Perhaps more significantly, it also released the world’s second largest publicly available voice dataset, called Common Voice, which was contributed to by nearly 20,000 people globally.

Mozilla began work on Common Voice in July 2017, calling for volunteers to submit samples of their speech, or check machine translations of other people speaking. By November Mozilla had accumulated nearly 400,000 recordings, representing 500 hours of speech. More is coming as this release is just the first tranche, Sean White wrote in the blog post. White explained why Common Voice is so significant, “One reason so few services are commercially available is a lack of data. Startups, researchers or anyone else who wants to build voice-enabled technologies need high quality, transcribed voice data on which to train machine learning algorithms. Right now, they can only access fairly limited data sets.”

This is true as one oft-repeated complaint by the voice community is that there is not enough data of a decent quality to create models to train these applications. Of course there are the datasets that Amazon and Google have been creating over the years of different sounds and voices. Google makes some of its audio datasets publicly available, but as Steven Tateosian, director of secure Internet of Things (IoT) and industrial solutions of NXP Semiconductors noted, market talk characterizes these datasets as an interesting place to start, but not adequate for developing a production-level product. “There is just not enough data or maybe it is not of the highest quality or diversity within the dataset.” He added that he has heard similar complaints about other public datasets.

As a result many companies, including NXP, are opting to build their own dataset either in-house or by outsourcing the task to a third party as NXP has done. Some companies will use public datasets to complement their own in-house dataset development; others find the datasets sufficient for the product niche they are targeting.

But this is not to say that publicly-available voice datasets should be summarily dismissed from consideration. Common Voice, for example, has all the earmarks of a robust collection of sounds and voices. Here are other voice datasets, both public and private, that are worth exploring.

Related Article:  Github's Top Open Datasets For Machine Learning

Google Audioset

Google Audioset is an expanding ontology of 635 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. Google collected data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context and content analysis. “The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers,” Google wrote. “By releasing AudioSet, we hope to provide a common, realistic-scale evaluation task for audio event detection, as well as a starting point for a comprehensive vocabulary of sound events.”


VoxCeleb is a large-scale speaker identification dataset. It contains around 100,000 phrases by 1,251 celebrities, extracted from YouTube videos, spanning a diverse range of accents, professions and age. “It’s an intriguing use case for isolating and identifying which superstar the voice belongs to,” according to VoxCeleb.

Related Article:  Github's Top Open Datasets For Machine Learning

2000 HUB5 English Evaluation Transcripts

2000 HUB5 English Evaluation Transcripts was developed by the Linguistic Data Consortium (LDC) and consists of transcripts of 40 English telephone conversations. The HUB5 evaluation series focused on conversational speech over the telephone with the task of transcribing conversational speech into text. Its goals were to explore promising new areas in the recognition of conversational speech, to develop advanced technology incorporating those ideas and to measure the performance of new technology. The release contains transcripts in .txt format for the 40 source speech data files used in the evaluation: (1) 20 unreleased telephone conversations from studies in which recruited speakers were connected through a robot operator to carry on casual conversations about a daily topic; and (2) 20 telephone conversations of unscripted telephone conversations between native English speakers.

CALLHOME American English Speech

CALLHOME American English Speech was developed by the Linguistic Data Consortium (LDC) and consists of 120 unscripted 30-minute telephone conversations between native speakers of English. All calls originated in North America; 90 of the 120 calls were placed to various locations outside of North America, while the remaining 30 calls were made within North America. Most participants called family members or close friends.

LibriSpeech ASR Corpus

LibriSpeech consists of approximately 1,000 hours of 16kHz read English speech. The data is derived from read audiobooks from the LibriVox project. The LibriVox project is a volunteer effort responsible for the creation of approximately 8,000 public domain audio books, the majority of which are in English. Most of the recordings are based on texts from Project Gutenberg, also in the public domain.

The CHiME-5 Dataset

This dataset deals with the problem of conversational speech recognition in everyday home environments. Speech material was elicited using a dinner party scenario. Namely, the dataset is made up of the recording of twenty separate dinner parties that are taking place in real homes. Each party lasted a minimum of 2 hours and was composed of three phases that were in the:

  • Kitchen, preparing the meal.
  • Dining room, eating the meal.
  • Living room, for post-dinner conversation.


The TED-LIUM corpus was made from audio talks and their transcriptions available on the TED website. It consists of 2,351 audio talks, 452 hours of audio and 2,351 aligned automatic transcripts in STM format.

Free Spoken Digit Dataset

A simple audio/speech dataset consisting of recordings of spoken digits. The recordings are trimmed so that they are silent at the beginnings and ends.