CLUSTER ANALYSIS OF SMALL SAMPLE MULTIDIMENSIONAL DATA ON THE EXAMPLE OF RECOVERY TIMES IN COVID-19 DISEASE

Authors

DOI:

https://doi.org/10.31891/2307-5732-2026-361-16

Keywords:

small sample, coronavirus, cluster analysis, Lance-Williams method, dendrogram, multiple regression

Abstract

The article examines the results of hierarchical agglomerative cluster analysis of a small sample of multidimensional data relating to patients who recovered from Covid-19 coronavirus. The condition of patients at the time of recovery is described by medical and physiological signs. The aim of the study is to identify the relationship between the duration of recovery, i.e. the time spent in a medical institution. To study this relationship more precisely, the group of patients is divided into separate subgroups - clusters. The selection of clusters was carried out using a dendrogram, which resulted in three clusters of patients. The distances between the signs are represented by the Euclidean metric. As medical and physiological signs for cluster analysis, the values ​​of the following indicators were given: physical (age, height, weight); cardiovascular, respiratory, immune and circulatory systems. The final result should be linear regression models of the relationship between the recovery time and the physiological state of patients. In other words, patients should be represented by separate groups – clusters, and linear multiple regression models should be constructed for each cluster. To achieve the goal, the following methods were used: correlation analysis of the “patient – ​​symptom” table, elimination of multicollinearity, hierarchical agglomerate cluster analysis according to the Lance-Williams method, using a flexible strategy and determining the Euclidean distance between normalized symptom values. After combining patients into clusters, linear models of linear multiple regression of multidimensional data were constructed. Based on the analysis, the following results were obtained: correlation analysis showed the insignificance of the relationship between recovery time and symptoms, in the proximity matrix the distances differ little from each other, the quality of the clustering itself is low, however, the multiple regression models are adequate for each of the three clusters with a coefficient of determination close to unity. In addition, as a result of regression analysis, three features were discarded due to their lack of influence on the recovery period. It was found that despite the low quality of clustering, the models adequately describe the relationship between recovery time and individual medical and physiological indicators. The results can be used in medical and biological research, in school pedagogical practice and other studies.

Published

2026-01-29

How to Cite

KAMINSKY, R., SHAKHOVSKA, N., & DMYTRIV, G. (2026). CLUSTER ANALYSIS OF SMALL SAMPLE MULTIDIMENSIONAL DATA ON THE EXAMPLE OF RECOVERY TIMES IN COVID-19 DISEASE. Herald of Khmelnytskyi National University. Technical Sciences, 361(1), 127-134. https://doi.org/10.31891/2307-5732-2026-361-16