EVALUATION AND COMPARISON OF TEXT-TO-AUDIO GENERATION MODELS FOR MEDIA APPLICATIONS

OLEKSANDR MEDIAKOV; YURII BABIAK; TARAS BASYUK

doi:10.31891/2307-5732-2025-351-3

Authors

OLEKSANDR MEDIAKOV Lviv Polytechnic National University Author https://orcid.org/0000-0002-2580-3155
YURII BABIAK Lviv Polytechnic National University Author https://orcid.org/0009-0009-2771-3389
TARAS BASYUK Lviv Polytechnic National University Author https://orcid.org/0000-0003-0813-0785

DOI:

https://doi.org/10.31891/2307-5732-2025-351-3

Keywords:

diffusion models, audio generation, reverse diffusion, text-to-audio generation, generative AI evaluation

Abstract

In this research paper we aim to evaluate and compare the performance of several state-of-the-art text-to-audio generation models in producing audio effects for media applications. To achieve this, we created a new evaluation framework, including curated dataset of text-audio pairs that can be used in media products, and a comprehensive set of metrics, namely: Kullback–Leibler divergence between classification labels of true and generated audio, the Contrastive Language–Audio Pretraining (CLAP) embedding similarity, text-caption cosine similarity, and Fréchet Audio Distance (FAD) between expected and generated audios. Our results demonstrate that Stable Audio Open exhibited the highest performance across most metrics, indicating superior audio quality and semantic alignment. This comprehensive study not only quantifies the performance of these models but also provides a detailed analysis of their strengths and weaknesses in a real-world media production context. The findings reveal the intricate relationship between model architecture, training strategies, and the resulting audio quality. We also found that increasing inference steps generally improved semantic alignment but with diminishing returns beyond 100 steps. Our results also include investigation into the trade-off between models’ sizes, training strategies and performance. Scientifically, this study provides a new solid benchmark for evaluating text-to-audio generation models and contributes to a deeper understanding of diffusion-based audio synthesis. Practically, our findings offer clear guidance for media creators and developers in selecting appropriate models for specific applications, facilitating the integration of advanced audio generation into media production. Furthermore, the curated dataset and defined metrics serve as valuable resources for future research in this field.

EVALUATION AND COMPARISON OF TEXT-TO-AUDIO GENERATION MODELS FOR MEDIA APPLICATIONS

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Language

Make a Submission

Index

For Avtors

Flag