Mi-Go: tool which uses YouTube as data source for evaluating general-purpose speech recognition machine learning models

Wojnar, Tomasz; Hryszko, Jarosław; Roman, Adam

doi:10.1186/s13636-024-00343-9

Software
Open access
Published: 01 May 2024

Mi-Go: tool which uses YouTube as data source for evaluating general-purpose speech recognition machine learning models

EURASIP Journal on Audio, Speech, and Music Processing volume 2024, Article number: 24 (2024) Cite this article

257 Accesses
Metrics details

Abstract

This article introduces Mi-Go, a tool aimed at evaluating the performance and adaptability of general-purpose speech recognition machine learning models across diverse real-world scenarios. The tool leverages YouTube as a rich and continuously updated data source, accounting for multiple languages, accents, dialects, speaking styles, and audio quality levels. To demonstrate the effectiveness of the tool, an experiment was conducted, by using Mi-Go to evaluate state-of-the-art automatic speech recognition machine learning models. The evaluation involved a total of 141 randomly selected YouTube videos. The results underscore the utility of YouTube as a valuable data source for evaluation of speech recognition models, ensuring their robustness, accuracy, and adaptability to diverse languages and acoustic conditions. Additionally, by contrasting the machine-generated transcriptions against human-made subtitles, the Mi-Go tool can help pinpoint potential misuse of YouTube subtitles, like search engine optimization.

1 Introduction

Speech recognition has become a critical component in numerous applications, ranging from virtual assistants and transcription services to voice-controlled devices and accessibility tools. The increasing reliance on speech recognition machine learning models necessitates robust and comprehensive evaluation methodologies to ensure their performance, reliability, and adaptability across diverse scenarios.

Existing speech recognition models evaluations often rely on curated datasets, such as LibriSpeech [25], CommonVoice [4], and TIMIT [32]. While these datasets provide a controlled environment for evaluation, they may not capture the full spectrum of real-world scenarios, potentially limiting the model’s generalizability. Additionally, these datasets may not be updated frequently, resulting in potential stagnation in performance evaluation.

In this article, we introduce Mi-Go (the name will be explained further), a tool designed to evaluate the prediction performance of general-purpose speech recognition machine learning models. Mi-Go harnesses the power of YouTube as a data source, providing access to a virtually unlimited repository of diverse audio-visual content. YouTube offers a rich and continuously updated collection of spoken language data, encompassing various languages, accents, dialects, speaking styles, and audio quality levels. This makes it an ideal source of data which can be used to evaluate the adaptability and performance of speech recognition models in real-world situations.

In recent years, there has been a growing interest in harnessing the vast amount of data available on platforms such as YouTube for machine learning tasks. Various approaches have been proposed to collect and process data from YouTube, including YouTube-8M [1], AudioSet [11], and GigaSpeech [6]. However, these methods primarily focus on video and audio classification tasks rather than the evaluation of speech recognition models.

The landscape of speech recognition technology has witnessed a paradigm shift, driven by rapid advancements in deep learning and artificial intelligence. Groundbreaking architectures, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and, more recently, transformer-based models, have revolutionized this domain, offering unprecedented accuracy in transcribing human speech. These models, trained on vast datasets, have demonstrated remarkable proficiency in navigating the complexities of language, including accents, dialects, and noise interference. The emergence of these models not only underscores the accelerated pace of development in this field but also leads one to believe that in the near future seamless human-computer interaction will become the norm. It should be noted that while these advancements present exciting prospects, they also raise compelling questions concerning data privacy, algorithmic bias, and the digital divide.

In our study, we address this need by proposing―and then empirically investigating the prediction performance of speech recognition model―evaluation tool which utilizes YouTube as a data source, providing access to an extensive and diverse collection of audio samples for evaluation purposes. This approach ensures that the performance assessment remains up-to-date and relevant, capturing the nuances of real-world speech more accurately than curated datasets. To the best of our knowledge, there is little or even no research on using YouTube and video subtitles provided by the YouTube users for speech recognition evaluation. Considering all the above, our goal is to answer the following research question:

(RQ)
Will evaluation of the selected speech recognition machine learning model using YouTube as a data source, as made possible by Mi-Go, produce similar results (measured using the same metric) as the evaluation conducted by the model creators?

Mi-Go automates the process of data extraction, annotation, and evaluation from YouTube, ensuring an up-to-date and representative sample for evaluation purposes. By leveraging algorithms for data filtering and annotation, Mi-Go facilitates a thorough and unbiased evaluation of the speech recognition models. Moreover, Mi-Go is designed to be easily adaptable, allowing for seamless integration with variety of different speech recognition solutions, making it a versatile and valuable tool in the speech recognition research community.

The primary motivation behind the development of the Mi-Go tool stems from the recognition of several limitations in existing approaches to evaluate speech recognition models. As speech recognition technology continues to play a critical role in various applications, including voice assistants, transcription services, and accessibility tools, ensuring the robustness and accuracy of these models is crucial.

Other speech recognition model evaluation methods often rely on static, curated datasets which, while useful for establishing a controlled environment, may not fully represent the diversity and complexity of real-world speech scenarios. This can lead to overfitting and limit the model’s generalizability, ultimately affecting its performance in real-world applications.

Additionally, as the field of speech recognition rapidly advances, existing evaluation methods may struggle to keep pace with new developments and challenges, potentially hindering the progress of these models. By utilizing YouTube as a data source, Mi-Go aims to overcome these limitations and offers a more comprehensive and dynamic evaluation environment.

Another motivation for the development of Mi-Go is the need for a flexible and adaptable tool capable of accommodating variety of speech recognition models. This adaptability allows researchers and developers to compare and contrast the performance of various models, facilitating the continuous improvement and refinement of speech recognition systems.

By addressing these limitations and providing a dynamic, diverse, and adaptable evaluation tool, Mi-Go aspires to contribute significantly to the field of speech recognition research, driving innovation and fostering the development of highly accurate and robust models for various applications.

In a summary, the Mi-Go tool is a contribution to the scientific and speech recognition community for the following reasons:

Rich and diverse test data source. Mi-Go leverages YouTube, a platform with vast and continuously updated content, to provide a rich source of diverse audio-visual content. This includes various languages, accents, dialects, speaking styles, and audio quality levels. Such diversity is ideal for evaluating the adaptability and performance of speech recognition models in real-world situations, ensuring robustness, accuracy, and adaptability to diverse languages and acoustic conditions.
Dynamic evaluation environment. By using YouTube as a data source, Mi-Go addresses limitations of previous approaches that often relied on static and potentially outdated datasets. It offers a more comprehensive and dynamic evaluation environment that reflects current real-world scenarios. This adaptability allows for the comparison of various models and facilitates the continuous improvement and refinement of speech recognition systems.
Practical and theoretical contributions. The experimental results obtained through Mi-Go highlight the utility of YouTube as a valuable data source for the evaluation of speech recognition models. This not only underscores the platform’s potential in enhancing model robustness and adaptability but also contributes to the academic discourse by providing a novel methodology for speech recognition research. Additionally, Mi-Go’s approach to contrasting machine-generated transcriptions against human-made subtitles offers insights into potential misuse of subtitles, such as for search engine optimization purposes, thereby adding a layer of practical utility in detecting transcription anomalies.

2 YouTube as a data source for speech recognition model evaluation

With over 2 billion monthly active users and a diverse array of content uploaded every day, YouTube offers a rich resource for researchers and developers working on speech recognition technology. By tapping into this wealth of multilingual and multi-genre content, it is possible to evaluate and refine speech recognition models across various languages, dialects, and acoustic environments.

A vast digital archive. YouTube stands as a colossal repository of digital content, presenting an unparalleled resource for research across various disciplines. As the world’s largest video sharing platform, it hosts an estimated billions of videos, a number that continues to grow exponentially with about 500 hours of new content uploaded every minute. Exact number of hosted videos is not known, but is estimated for not less than 2.5 billion of videos [3]. The number of YouTube “Shorts” videos only, identified through the usage of the hashtag #shorts, reaches approximately 828 million in February 2024^{Footnote 1}.

Diversity of content. YouTube’s vast library of user-generated content covers an extensive range of topics, languages, and styles. This diversity enables the evaluation of speech recognition models in real-world scenarios, such as noisy environments, various accents, and even low-quality audio recordings. By evaluating models on such a diverse dataset, researchers can identify potential weaknesses and areas for improvement, ultimately resulting in more robust and accurate speech recognition systems.

Multilingual corpus. One of the key advantages of using YouTube for speech recognition model evaluation is the platform’s multilingual nature. Videos on the site are available in numerous languages, allowing for the assessment of models’ performance across different linguistic settings. This multilingual corpus is invaluable for developing models that can handle a variety of languages, accents, and dialects, thereby expanding their utility and applicability.

Availability of human-generated transcripts. Many YouTube videos come with human-generated subtitles, either provided by content creators or contributed by users through the platform’s community contributions feature. These transcripts serve as valuable ground-truth data for evaluating speech recognition models, as they offer a reliable source of comparison for the models’ output. By comparing model-generated transcriptions with human-generated ones, researchers can assess the accuracy and performance of their models, identifying areas where improvements are needed.

Potential for continuous model improvement. The ever-growing volume of content on YouTube presents an opportunity for continuous improvement and adaptation of speech recognition models. As new videos are uploaded, models can be re-evaluated and fine-tuned to ensure they remain up-to-date and effective in an ever-changing linguistic landscape. This continuous feedback loop helps researchers identify trends, challenges, and emerging language patterns, which can be incorporated into model updates.

YouTube is an invaluable platform for speech recognition model evaluation due to its diverse, multilingual content and the availability of human-generated transcripts. By leveraging this vast resource, researchers and developers can evaluate and refine their models, ensuring they are robust, accurate, and adaptable to a variety of languages and acoustic conditions.

3 Related work

Studies leveraging YouTube in the area of automatic speech recognition have made significant strides across various facets of the field. These investigations utilize YouTube’s extensive library of videos to create datasets, improve speech recognition systems, and explore new approaches to automatic speech recognition, showcasing the platform’s value in advancing speech recognition technology research. Key insights from these works include:

Datasets for automatic speech recognition models creation. Researchers have developed methodologies for creating databases for audio/visual speech recognition using YouTube videos, such as the comprehensive Spanish dataset by Córdova Esparza et al [7]. In their work, researchers presented a novel approach for creating an audio/visual speech recognition database, particularly addressing the scarcity of datasets in languages other than English, with a focus on Spanish. By selecting hundreds of YouTube videos, the researchers were able to extract facial features and align voice with text with millisecond accuracy, creating a dataset of over 100,000 samples. That methodology not only facilitated the development of automatic speech recognition systems in underrepresented languages but also provided a blueprint for creating datasets in any language by selecting appropriate YouTube content. Takamichi et al. [29] contributed to the diversification of automatic speech recognition research resources through the JTubeSpeech corpus, which consists of Japanese speech collected from YouTube. This corpus was designed for both speech recognition and speaker verification tasks, addressing the need for comprehensive datasets in Japanese for training and evaluating automatic speech recognition systems. The corpus’s creation from YouTube videos ensured a variety of speech contexts and speaker demographics, enhancing the robustness of automatic speech recognition models trained on it. Lakomkin et al. [20] developed the KT-speech-crawler, an automated tool for constructing speech recognition datasets from YouTube videos. This tool leveraged automatic captioning provided by YouTube to generate datasets, significantly reducing the manual effort required in dataset creation and enabling researchers to easily compile large-scale datasets tailored to specific speech recognition research needs. Latest work in the field―creation of Yodas, a YouTube-derrived Dataset, by Li et al. [22], showcases the ongoing efforts to harness YouTube content as diverse and comprehensive training data resource for developing new, robust speech recognition models. By compiling a diverse set of audio and speech samples from YouTube, Yodas aims to provide a versatile dataset that supports a wide range of automatic speech recognition tasks, including dialect and accent recognition, speech-to-text conversion, and speaker verification.
Improvement of automatic speech recognition systems. Liao et al. [23], from Google, explored usage of new large scale deep neural network acoustic modeling for using in YouTube video transcription. By leveraging the massive amount of unlabeled audiovisual content on YouTube, the researchers were able to enhance the modeling process, by using video transcripts uploaded by YouTube users and thus demonstrating the potential of semi-supervised learning approaches in improving automatic speech recognition systems’ performance, especially in noisy and challenging acoustic environments. Their findings then were used in actual YouTube automatic speech transcription improvements.
Audio-visual speech recognition. In their work, Serdyuk et al. [28] delved into the enhancement of automatic speech recognition by incorporating video content from YouTube, a novel approach that significantly improved speech recognition accuracy. That study leveraged a large corpus of YouTube videos to train models, focusing on how the visual modality, particularly the movement of the speaker’s mouth, could augment audio features for speech recognition tasks. By replacing traditional 3D convolutional neural networks with a video transformer to extract visual features, Serdyuk and his team demonstrated a substantial improvement in word error rates on both a labeled subset of YouTube videos and the LRS3-TED public corpus (described in [2]). Their methodology highlighted the potential of utilizing video content alongside audio data to advance the capabilities of automatic speech recognition systems. This research not only showcased the importance of YouTube as a rich data source for speech recognition technologies but also opened new pathways for enhancing speech recognition accuracy by integrating audio-visual data, paving the way for more sophisticated and efficient automatic speech recognition systems.
Bias and inclusivity in automatic speech recognition. Koenecke et al. [18] uncovered significant racial disparities in the performance of commercial automatic speech recognition systems, including those developed by major tech companies. By analyzing speech from white and African American speakers, the study revealed a higher word error rate for African American speakers, highlighting a critical area for improvement in making automatic speech recognition technologies more inclusive and equitable. Tatman and Kasten [30] investigated the effects of talker dialect, gender, and race on the accuracy of Bing Speech and YouTube automatic captions. Their findings emphasized the impact of sociolinguistic factors on automatic speech recognition accuracy, urging the development of more sophisticated models that could better accommodate the diversity of human speech.
Utilizing YouTube as automatic speech recognition tool. Kim et al. [17] embarked on an insightful exploration into the capabilities of automatic speech recognition tools by utilizing YouTube’s automatic transcription service as a benchmark for automatic speech recognition accuracy. In their study, they meticulously compared manual transcriptions with those generated automatically by YouTube, alongside other leading speech recognition platforms such as Google Cloud, IBM Watson, Microsoft Azure, and Trint. Their analysis provided a comprehensive evaluation of the relative performance of these services, with a particular focus on YouTube’s efficacy in providing accurate transcriptions. This approach not only highlighted YouTube’s potential as an accessible and effective tool for automatic speech recognition but also contributed to the broader discourse on the reliability and accuracy of free, platform-based speech recognition services. Through their comparative study, Kim et al. shed light on the strengths and limitations of YouTube’s transcription capabilities, offering valuable insights for researchers, developers, and users seeking to leverage automatic speech recognition technology in various contexts.

These studies illustrate the extensive use of YouTube as a rich data source for automatic speech recognition research, ranging from training dataset creation to addressing biases and inclusivity in speech technologies. However, to the best of our knowledge, there is no work describing the direct use of YouTube to evaluate the functional performance of the existing machine learning models used for automatic speech recognition.

4 Mi-Go Tool

Mi-Go was written in Python programming language. Its source code is available for download under Apache-2.0 license at the following address: https://github.com/Kowalski1024/Mi-Go

In the following, we will describe the tool by focusing on the subsequent operations of the tool – from launching it to saving the evaluation results of the selected speech recognition model.

4.1 Test Plan preparation

To start working with the tool, we need a file in JSON format, called a Test Plan. This is illustrated as number 1 in Fig. 1. In a special circumstances, Test Plan file can be manually written, but it is more efficient to generate it, using an additional script named the Test Plan Generator. This script queries YouTube’s API to compile a random list of videos, basing on the command line parameters specifying the category of the videos, language, duration, and desired quantity of list items (details can be found in Appendix 1). It is essential that the YouTube clearly indicates, that video has human-made subtitles, and only such videos are considered. To query the API, Test Plan Generator uses external Python library called youtube-transcript-api^{Footnote 2}. After querying the API, the Test Plan file contains all the necessary metadata about the videos being used in further evaluation and it also stores information about the selected parameters and token for YouTube Data API, which can be used in next test iterations, if needed.

4.2 Data extraction and transcription

In the next step, marked with number 2 in Fig. 1, Mi-Go reads the Test Plan and, basing on that plan, downloads from YouTube the audio track of each video from the plan and the subtitles for that video. Thus, for each video, we have a pair consisting of an audio file (2a) in and human-generated subtitles (marked as 2b).

In the next step (number 3 in Fig. 1), a speech recognition model is employed to convert the downloaded audio into a textual transcript. It is done by the TranscriptTest component that executes the speech recognition machine learning model against audio data collected from YouTube. That component can be adjusted for specified speech recognition model by extending that component with model-specific code. It allows to use different models from popular “Hugging Face” machine learning models repository^{Footnote 3} as well as models dedicated for such toolkits like ESPnet or NeMo.

To eliminate inconsequential textual differences, both the subtitles downloaded from YouTube (number 2b in Fig. 1) and those generated by the speech recognition model (4) undergo a normalization process (5a and 5b) using an OpenAI’s normalization function^{Footnote 4}.

4.3 Evaluation and metrics

Speech recognition model evaluation involves comparing the human-made subtitles downloaded from YouTube and those generated by the model (number 6 in Fig. 1). For that evaluation, Mi-Go tool uses a open-source JiWER library^{Footnote 5} to calculate Word Error Rate (WER) measure [27]. WER is a common metric used to assess the performance of speech recognition systems, automatic translation systems, and other tasks involving transcription or translation. It is calculated by determining the minimum number of operations needed to transform the system output into the correct output. These operations include (see Eq. 1): word insertions I, word deletions D, and word substitutions S. To compute the WER, the total number of these operations is divided by the total number of words in the correct output N (in our case: total number of words in subtitles attached to a particular YouTube video), yielding a ratio that represents the rate of errors per word. The lower the WER, the better the performance of the system, as it means fewer errors were made.

$$\begin{aligned} \text{ WER } = \frac{S + D + I}{N} \cdot 100\% \end{aligned}$$

(1)

The concept of WER has been part of the field of automatic speech recognition and computational linguistics for many years. It is based on the Levenshtein distance or edit distance, a string metric for measuring the difference between two sequences, introduced by Vladimir Levenshtein in 1965 [21]. The exact individual or group that first applied this concept specifically as Word Error Rate in speech recognition or translation systems is not clearly documented. It likely emerged from the academic and industry communities working on speech and language processing technologies. WER has since become a standard measure in these fields. In some cases, WER is expressed as a percentage (by multiplying the original formula by 100%), especially when easy understanding of the measure is a main concern.

The comparison results are stored both in the SQLite database (7b in Fig. 1) and directly in the previously used Test Plan file (7a). Such a Test Plan file, with its evaluation results recorded, can be reused for subsequent evaluation iterations, for instance, to augment results not previously gained, or to retest the same videos, specified within it. Such dual storage approach (database and Test Plan file) facilitates simple access, filtering, and analysis of the evaluation results.

5 Experimental setup

Here, we describe an experimental setup that leverages the Mi-Go tool to use YouTube videos, across all categories, as a data to evaluate speech recognition models by comparing their output with human-made transcripts. The purpose of the experiment is to confirm whether the following setup (Mi-Go and YouTube as evaluation data source) will allow to evaluate the speech recognition models and obtain evaluation results similar to those obtained by the model creators.

5.1 Machine learning models used in the experiment

5.1.1 OpenAI’s Whisper

OpenAI, a company most notably recognized for its contribution to the field of artificial intelligence through the development of advanced large language models like GPT-3 and GPT-4, also developed state-of-the-art, general-purpose speech recognition models, which demonstrate exceptional performance in various applications, called Whisper [27].

Due to proven outstanding performance of that model family, as well as the fact that it has been made available under a open-source MIT Licence, we decided that, in our experiment, we will mainly focus on evaluation of the Whisper models. At this point, we should explain that the name “Mi-Go” comes from a novella by H.P. Lovecraft called “The Whisperer in Darkness”; thus, in our opinion, it would make a good name for the tool initially created to evaluate the Whisper models.

The model is based on a Transformer sequence-to-sequence architecture and is trained on a range of speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. These tasks are collectively represented as a sequence of tokens to be predicted by the decoder, enabling a single model to supplant multiple stages of a conventional speech processing pipeline. The multitask training approach employs a series of unique tokens that act as task specifiers or classification targets [27].

Whisper model is available in five different sizes. Four of them (tiny, base, small, medium) having additional English-only versions, which―according to the creators― perform better when used in English-only applications [16]. Thus, in our research, we decided to use English-only model versions. The “large” model was improved twice; thus, in our experiment, we used two versions of “large” model―initial version, marked as “Whisper large-v1” and latest version, marked as “Whisper large-v3.” Each model offers a balance between speed and accuracy. The names of the used models, their approximate memory requirements and relative speeds are provided in Table 1.

Table 1 Comparison of Whisper models [16]

Mi-Go: tool which uses YouTube as data source for evaluating general-purpose speech recognition machine learning models

Abstract

1 Introduction

2 YouTube as a data source for speech recognition model evaluation

3 Related work

4 Mi-Go Tool

4.1 Test Plan preparation

4.2 Data extraction and transcription

4.3 Evaluation and metrics

5 Experimental setup

5.1 Machine learning models used in the experiment

5.1.1 OpenAI’s Whisper

5.1.2 NVIDIA’s Conformer-Transducer X-Large

5.1.3 ESPnet2 model

5.1.4 Facebook’s wav2vec2-base-960h

5.2 Data collection and preparation

6 Results

7 Conclusions and future work

Availability of data and materials

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Authors’ information

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Appendices

Appendix 1: Testplan generator command-line parameters

1.1 Usage

1.2 Required arguments

1.3 Optional arguments

1.4 Examples

Appendix 2: Comparison of WER values for different datasets to our results, broken down by model types

Appendix 3: Detailed results for particular models, broken down by YouTube categories

Appendix 4: List of YouTube videos randomly selected by Mi-Go tool for speech recognition model evaluation experiment

1.1 Category: Autos & Vehicles

1.2 Category: Comedy

1.3 Category: Education

1.4 Category: Entertainment

1.5 Category: Film & Animation

1.6 Category: Gaming

1.7 Category: Howto & Style

1.8 Category: Music

1.9 Category: News & Politics

1.10 Category: Nonprofits & Activism

1.11 Category: People & Blogs

1.12 Category: Pets & Animals

1.13 Category: Science & Technology

1.14 Sports

1.15 Category: Travel & Events

Rights and permissions

About this article

Cite this article

Share this article