Ubiqus releases a new dataset to train AI

Artificial Intelligence needs a lot of relevant data to “learn”

In recent months, Artificial Intelligence (AI) has become a buzzword. AI is often presented as a kind of magic grey matter that learns by itself. This is true to a certain extent.

Indeed, AI is based on learning models. But to successfully train these models, the scientists will need to use enough data (quantity) and relevant data, fit for purpose (quality).

The data needed for Artificial Intelligence to “learn” is called corpora (or corpus, in the singular form).

Ubiqus R&D: a recognised player in Automatic Transcription

The Ubiqus Group is a pioneer in the field of transcription. As such, we have been involved for many years in creating and improving new transcription solutions.

As a major player in this market in France and worldwide, Ubiqus has partnered with the scientific community working on the automatic transcription, called ASR (Automatic Speech Recognition).

A new reference corpus for the scientific community

A few days ago, Ubiqus achieved a major breakthrough! Ubiqus team published a new dataset in English in partnership with the LIUM (IT Laboratory of the University of Le Mans) simply named TED-LIUM3.
As its says on the tin, it is a corpus composed of transcripts of TED conferences. If you never heard of them, they are public lectures in English on a variety of topics of “ideas worth sharing” [to learn more about TED, read this article from Wikipedia or go directly to the TED website].

The work of the Ubiqus Group’s team of researchers, combined with those of the university, has made it possible to significantly improve TED-LIUM2. This latter had become a benchmark dataset to train ASR systems, such as the Kaldi toolkit.

In layman’s terms, the data scientists increased the volume of data from 207 to 452 hours of TED conferences transcribed and aligned (alignment is to compare segments of the transcript with the relevant audio).

Doing this, they also demonstrated that the addition of more qualified data in various automatic transcription models made it possible to significantly improve the quality of the transcription by reducing the WER, Word Error Rate.


In conclusion, with such results (published a few days ago in this scientific article) and this new corpus, the research community will continue to advance the definition and training of acoustic models in English and will further improve the result of automatic transcription.


Leave a Reply

Your email address will not be published. Required fields are marked *