EVALUATION AND COMPARISON OF TEXT-TO-AUDIO GENERATION MODELS FOR MEDIA APPLICATIONS
DOI:
https://doi.org/10.31891/2307-5732-2025-351-3Keywords:
diffusion models, audio generation, reverse diffusion, text-to-audio generation, generative AI evaluationAbstract
In this research paper we aim to evaluate and compare the performance of several state-of-the-art text-to-audio generation models in producing audio effects for media applications. To achieve this, we created a new evaluation framework, including curated dataset of text-audio pairs that can be used in media products, and a comprehensive set of metrics, namely: Kullback–Leibler divergence between classification labels of true and generated audio, the Contrastive Language–Audio Pretraining (CLAP) embedding similarity, text-caption cosine similarity, and Fréchet Audio Distance (FAD) between expected and generated audios. Our results demonstrate that Stable Audio Open exhibited the highest performance across most metrics, indicating superior audio quality and semantic alignment. This comprehensive study not only quantifies the performance of these models but also provides a detailed analysis of their strengths and weaknesses in a real-world media production context. The findings reveal the intricate relationship between model architecture, training strategies, and the resulting audio quality. We also found that increasing inference steps generally improved semantic alignment but with diminishing returns beyond 100 steps. Our results also include investigation into the trade-off between models’ sizes, training strategies and performance. Scientifically, this study provides a new solid benchmark for evaluating text-to-audio generation models and contributes to a deeper understanding of diffusion-based audio synthesis. Practically, our findings offer clear guidance for media creators and developers in selecting appropriate models for specific applications, facilitating the integration of advanced audio generation into media production. Furthermore, the curated dataset and defined metrics serve as valuable resources for future research in this field.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 ОЛЕКСАНДР МЕДЯКОВ, ЮРІЙ БАБ’ЯК, ТАРАС БАСЮК (Автор)

This work is licensed under a Creative Commons Attribution 4.0 International License.