AUTOMATED TRACEABILITY LINK RECOVERY BETWEEN REQUIREMENTS AND SOURCE CODE USING LARGE LANGUAGE MODELS

Authors

DOI:

https://doi.org/10.31891/2307-5732-2026-361-38

Keywords:

requirement tracing, large language models, CodeBERT, semantic similarity, program code, embeddings, automation

Abstract

The problem of consistency between software requirements and their implementation in source code becomes critically important as the scale and complexity of modern software systems increase, since the absence of reliable traceability links often leads to incomplete requirement implementation, complicates code maintenance, and hinders the verification of system correctness. Manual generation of traceability matrices is a labour-intensive and error-prone process, especially in large projects. The use of large language models opens up new opportunities for automating this process, as such models are capable of reflecting deep semantic relationships between text requirements and software code fragments.

The article proposes a method for identifying traceability links between requirements and software code using transformer models of large language systems. The proposed approach is based on converting text artefacts into vector representations using CodeBERT, SBERT, and TF-IDF models, followed by calculating semantic similarity to automatically identify potential connections. The method covers such stages as data preparation, embedding generation, search for relevant fragments, and evaluation of the results obtained.

The experiments were conducted on the MSR-2021 dataset, which contains real traceability links for several projects. The obtained results demonstrated the advantage of CodeBERT over traditional approaches (TF-IDF, SBERT): the method achieves accuracy of up to 0.85 and an F1-score of up to 0.50, depending on the search depth, which represents strong performance for automated information retrieval and ranking tasks. The study additionally confirmed the importance of considering the structural context of the code and showed the influence of the Top-K parameter on the balance between recall and precision. The results indicate that integrating LLM-based models significantly improves the level of automation, consistency, and quality of requirements traceability in modern software development environments.

Published

2026-01-29

How to Cite

SKRYPNIUK, O., BAHRII, R., MANZIUK, E., & SKRYPNYK, T. (2026). AUTOMATED TRACEABILITY LINK RECOVERY BETWEEN REQUIREMENTS AND SOURCE CODE USING LARGE LANGUAGE MODELS. Herald of Khmelnytskyi National University. Technical Sciences, 361(1), 268-275. https://doi.org/10.31891/2307-5732-2026-361-38