A METHOD OF AUTHORIAL REPRESENTATIONS OF TEXTS FORMING USING CONTRASTIVE LEARNING
DOI:
https://doi.org/10.31891/2307-5732-2026-365-94Keywords:
authorship attribution, contrastive learning, transformer models, text embeddings, metric learning, latent space, stylometryAbstract
The problem of authorship attribution remains one of the fundamental challenges in computational linguistics, digital forensics, and intelligent information systems, particularly in the context of rapidly growing volumes of unstructured textual data. Although modern transformer-based architectures provide high-quality contextual embeddings, their latent representations are not explicitly optimized for discriminating between authorial styles. As a result, texts produced by different authors may form overlapping clusters in the embedding space, which negatively affects classification robustness and interpretability.
The paper presents a method for forming authorial representations of texts using supervised contrastive learning aimed at improving the separability of author classes in the feature space. The created approach integrates transformer-based encoders with a contrastive metric learning module that explicitly optimizes embedding geometry by minimizing intra-class variance and maximizing inter-class distances. Positive and negative text pairs are constructed based on author labels, and a contrastive loss function is applied to enforce discriminative representation learning. The method includes stages of text preprocessing, contextual embedding extraction, pair construction, contrastive optimization, and author-level aggregation followed by classification.
Experimental evaluation was conducted on benchmark authorship attribution datasets, including PAN-2019, IMDB62, and the Blog Authorship Corpus. The created method was compared with baseline transformer classifiers without contrastive optimization. The results demonstrate a consistent improvement in classification accuracy, macro-averaged F1-score, and clustering quality metrics. The contrastive framework significantly enhances embedding compactness for texts of the same author while increasing distances between different author clusters. Experimental results confirm the effectiveness of the proposed method compared to baseline neural models without contrastive learning.
The scientific contribution of this study lies in the development of a supervised contrastive learning framework specifically tailored for authorial representation formation. The practical significance of the obtained results consists in improving the reliability of automated authorship attribution systems and enabling their application in digital forensics, plagiarism detection, cybersecurity monitoring, and large-scale text analytics. The proposed method can be extended to multilingual and cross-domain scenarios, forming a foundation for further research in discriminative author modeling and metric learning in natural language processing.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 ВІКТОРІЯ БАДЗЬ, ВАСИЛЬ ТЕСЛЮК (Автор)

This work is licensed under a Creative Commons Attribution 4.0 International License.