A METHOD FOR AUTOMATED CLASSIFICATION OF UML DIAGRAM TYPES BASED ON ANALYSIS OF PLANTUML INTERMEDIATE REPRESENTATIONS

VOLODYMYR POLISHCHUK; VLADYSLAVA SKIDAN

doi:10.31891/2307-5732-2026-365-78

Authors

VOLODYMYR POLISHCHUK Kyiv National University of Technologies and Design Author https://orcid.org/0009-0000-2161-4560
VLADYSLAVA SKIDAN Kyiv National University of Technologies and Design Author https://orcid.org/0000-0002-8358-9759

DOI:

https://doi.org/10.31891/2307-5732-2026-365-78

Keywords:

UML diagrams, PlantUML, diagram type classification, intermediate representation, heuristic classification, semantic-syntactic agreement

Abstract

PlantUML is one of the most widely used UML formats in open-source software repositories, yet all nine standard UML diagram types share a single @startuml entry point with no explicit type declaration. Existing automated classification approaches, image-based neural networks and LLM-based methods, require rendered images, training data, or external inference services, limiting their applicability for deterministic batch processing of large textual corpora. This paper raises a question: to what extent does PlantUML's syntax encode the diagram type? A heuristic classifier is proposed that operates on PlantUML's compiled intermediate representation using a three-tier dispatch strategy combining diagram class inspection, entity-level type analysis, and symbol-based majority counting. The classifier operates directly on the compiler's output, requiring no training data, rendered images, or external APIs. Evaluation on 143,258 diagrams from the UML-in-the-Wild dataset, a corpus of PlantUML files mined from open-source repositories on GitHub and Bitbucket, yielded 95.96% overall semantic-syntactic agreement between the proposed syntactic classifier and published LLM-based semantic labels, with 93–99.7% agreement for seven of nine types. Manual inspection of 110 disagreement files revealed that neither classifier is uniformly superior: 47% of disagreements are compiler-correct, 45% are LLM-correct, and 8% are ambiguous. The component/deployment boundary, where agreement drops to 48–62%, is attributed to two structural mechanisms: keyword semantic overloading (deployment-associated keywords such as database and node used for non-infrastructure concepts) and container-vs-leaf asymmetry (node containers invisible to entity-level inspection). These represent limitations of PlantUML's expressiveness rather than classifier deficiencies.

A METHOD FOR AUTOMATED CLASSIFICATION OF UML DIAGRAM TYPES BASED ON ANALYSIS OF PLANTUML INTERMEDIATE REPRESENTATIONS

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Language

Make a Submission

Index

Flag