RECOGNIZING A MELODY BY ITS FRAGMENT USINGNEURAL NETWORKS
DOI:
https://doi.org/10.31891/2307-5732-2026-361-20Keywords:
melody recognition, deep learning, mel-spectrogram, audio classification, Telegram botAbstract
The development of an intelligent system for automatic recognition of musical compositions from a short audio fragment using deep learning methods is aimed at addressing the complex problem of identifying melodies when textual information, tags, or metadata are unavailable. This task is particularly relevant in modern digital environments, where users frequently encounter unknown music through streaming platforms, social media, or real-life audio recordings. The proposed approach relies on convolutional neural networks (CNNs) as the core mechanism for extracting and classifying high-level audio representations.
In the course of the study, various factors influencing the performance and reliability of the recognition system were systematically examined. These included the choice of audio format (WAV versus MP3), the optimal length of analyzed fragments, the selection of spectral features (mel-spectrogram, chroma, constant-Q transform (CQT), and chroma energy normalized statistics (CENS)), as well as the effect of data augmentation techniques such as adding white noise or pitch shifting. Experimental evaluation demonstrated that the best balance between recognition accuracy and computational efficiency was achieved using one-second segments encoded in MP3 format and represented by mel-spectrograms. This configuration provided high robustness to common distortions while maintaining moderate resource consumption during training and inference.
The resulting deep learning model was successfully integrated into a Telegram bot that enables end users to send audio or voice messages for identification. Upon receiving an audio fragment, the system analyzes it and returns both the most probable match and five alternative predictions, offering flexibility in cases of ambiguous input. During testing, particular attention was paid to the influence of recording methods and data transmission quality. It was observed that recordings obtained through Telegram’s built-in voice messaging feature tend to produce lower recognition accuracy, primarily due to signal compression and the introduction of background noise.
The research outcomes confirm the feasibility of further enhancement of the system through the use of recurrent or hybrid architectures such as LSTM or GRU networks, expansion of the reference audio database, and training on synthetically distorted data to improve noise tolerance.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 МАКАР ДОРОЩУК, САВА ШЕВЧУК, ЛЕСЯ ДОБУЛЯК (Автор)

This work is licensed under a Creative Commons Attribution 4.0 International License.