ANALYSIS OF SEMANTIC CLUSTERS IN UKRAINIAN-LANGUAGE TELEGRAM POSTS: NLP METHODS AND VISUALIZATION

Authors

DOI:

https://doi.org/10.31891/2307-5732-2025-355-50

Keywords:

semantic clustering, social media posts, Linq-Embed-Mistral, HDBSCAN, KMeans, trend analysis

Abstract

This paper presents a comprehensive approach to analyzing Ukrainian-language content from social networks with the aim of identifying key thematic clusters and trends in public discourse. The focus is placed on short user-generated posts, which are typically fragmented, informal, and characterized by high semantic variability. To address these challenges, the study employs state-of-the-art natural language processing and machine learning methods, including the Linq-Embed-Mistral embedding model, trained via contrastive learning, as well as a hybrid clustering pipeline combining HDBSCAN and KMeans.

The proposed framework includes a full preprocessing pipeline: data collection from social media posts, noise reduction, tokenization, lemmatization, and normalization of text data. Each message is converted into a vector embedding using the Linq-Embed-Mistral model. These vectors are then clustered using a two-stage method: HDBSCAN detects dense semantic regions in the vector space, and KMeans refines the structure of clusters within these regions.

The resulting clusters are visualized using time series plots, heatmaps, boxplots, pie charts, and co-occurrence graphs of key terms. The study examines temporal posting patterns, message length distributions, and lexical connections within topics. Results confirm the effectiveness of the proposed methodology: identified clusters demonstrate high semantic coherence, as supported by both visual inspection and quantitative validation (average silhouette score > 0.7). The analysis reveals that dominant themes include political events, social initiatives, cultural announcements, and contests or public calls.

The approach shows strong performance on short and noisy text data and has promising potential for extension. Future work may include automatic topic summarization, sentiment analysis, and longitudinal tracking of discourse dynamics. The proposed solution offers a scalable tool for researchers in NLP, computational social science, digital media analysis, and information analytics.

Published

2025-08-28

How to Cite

LYNNYK, R., & VYSOTSKA, V. (2025). ANALYSIS OF SEMANTIC CLUSTERS IN UKRAINIAN-LANGUAGE TELEGRAM POSTS: NLP METHODS AND VISUALIZATION. Herald of Khmelnytskyi National University. Technical Sciences, 355(4), 349-356. https://doi.org/10.31891/2307-5732-2025-355-50