A METHOD FOR AUTOMATED CLASSIFICATION OF UML DIAGRAM TYPES BASED ON ANALYSIS OF PLANTUML INTERMEDIATE REPRESENTATIONS
DOI:
https://doi.org/10.31891/2307-5732-2026-365-78Keywords:
UML diagrams, PlantUML, diagram type classification, intermediate representation, heuristic classification, semantic-syntactic agreementAbstract
PlantUML is one of the most widely used UML formats in open-source software repositories, yet all nine standard UML diagram types share a single @startuml entry point with no explicit type declaration. Existing automated classification approaches, image-based neural networks and LLM-based methods, require rendered images, training data, or external inference services, limiting their applicability for deterministic batch processing of large textual corpora. This paper raises a question: to what extent does PlantUML's syntax encode the diagram type? A heuristic classifier is proposed that operates on PlantUML's compiled intermediate representation using a three-tier dispatch strategy combining diagram class inspection, entity-level type analysis, and symbol-based majority counting. The classifier operates directly on the compiler's output, requiring no training data, rendered images, or external APIs. Evaluation on 143,258 diagrams from the UML-in-the-Wild dataset, a corpus of PlantUML files mined from open-source repositories on GitHub and Bitbucket, yielded 95.96% overall semantic-syntactic agreement between the proposed syntactic classifier and published LLM-based semantic labels, with 93–99.7% agreement for seven of nine types. Manual inspection of 110 disagreement files revealed that neither classifier is uniformly superior: 47% of disagreements are compiler-correct, 45% are LLM-correct, and 8% are ambiguous. The component/deployment boundary, where agreement drops to 48–62%, is attributed to two structural mechanisms: keyword semantic overloading (deployment-associated keywords such as database and node used for non-infrastructure concepts) and container-vs-leaf asymmetry (node containers invisible to entity-level inspection). These represent limitations of PlantUML's expressiveness rather than classifier deficiencies.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 ВОЛОДИМИР ПОЛІЩУК, ВЛАДИСЛАВА СКІДАН (Автор)

This work is licensed under a Creative Commons Attribution 4.0 International License.