OPTIMIZATION OF RESOURCE UTILIZATION IN PROCESSING LARGE VOLUMES OF SEMI-STRUCTURED DATA
DOI:
https://doi.org/10.31891/2307-5732-2026-361-8Keywords:
semi-structured data, resource optimization, IoT, PySpark, Kubernetes, Big DataAbstract
In the era of digital transformation and the rapid proliferation of IoT devices, organizations are increasingly faced with the challenge of efficiently processing massive volumes of semi-structured data in real time. Such data—originating from sensors, smart devices, and distributed systems—often lack consistent structure, making their processing computationally expensive and resource-intensive. This paper presents a practical approach to optimizing resource utilization during the stream processing of semi-structured IoT data using a combination of Apache Spark Structured Streaming and Kubernetes-based orchestration.
A synthetic dataset simulating 10,000 sensor readings of various types (temperature, humidity, pressure) was generated to replicate a real-world industrial IoT environment. Apache Spark was employed for the real-time aggregation and analysis of the data stream, while Kubernetes was utilized to dynamically allocate computing resources via the Horizontal Pod Autoscaler (HPA). The proposed method was evaluated using key performance metrics, including average CPU and memory usage, system latency, and processing time per iteration.
The results demonstrate a significant improvement in performance and efficiency. After applying Kubernetes HPA, average CPU usage decreased from 85% to 55%, memory usage dropped from 80% to 50%, and processing latency was reduced by 25%. A comparative table and performance graphs are included to visualize the effectiveness of the optimization approach.
This work highlights the value of integrating cloud-native orchestration tools with big data streaming engines to enhance system scalability and responsiveness. The findings underscore that even relatively simple infrastructure configurations—when combined strategically—can yield substantial improvements without resorting to overly complex architectures. Future directions include applying predictive scaling based on machine learning models and further optimizing system configurations for different types and volumes of semi-structured data.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 ВОЛОДИМИР МЕЛЬНИК (Автор)

This work is licensed under a Creative Commons Attribution 4.0 International License.