ALGORITHMS AND CHALLANGES IN STREAMING DATA PROCESSING
DOI:
https://doi.org/10.31891/2307-5732-2023-327-5-42-42Keywords:
streaming data, stream processing, online data analysis, message queuesAbstract
The basic methods and tools of data analysis in the context of data streams, rather than batches, are considered. The fundamental principles and algorithms are the same in both cases, but streaming data imposes significant constraints on memory and time, requiring additional methods for accumulation, filtering, and preprocessing. Mostly, these methods are applied to raw data, and raw data is everywhere now. We have constant streams of data in many areas, such as sports analytics, medical analytics, patient monitoring, real-time stock market analysis, website visitors' data analysis, infrastructure monitoring, predictive maintenance, not to mention various scientific research projects that gather vast amounts of data.
This paper provides a comparative analysis of the main types of algorithms and discusses current applied problems in stream processing and online data analysis. Specifically, algorithms such as Stream DBScan, DGIM, HyperLogLog, Bloom filter, and Count-Min Sketch are described and compared in the context of their application and computational complexity. A brief description of the Kafka message broker and the Spark Streaming framework is presented, though the number of tools and frameworks available now is constantly expanding. They support concepts such as windowing, event time processing, and state management, machine learning libraries, and enable advanced analytics on streaming data. They also address issues of scalability and provide the throughput for handling large volumes of data.
From a technical standpoint, two factors are equally important for streaming data analysis: the choice of the technological stack and the choice of the algorithm. It is stated that the most important task is obtaining raw streaming data, selecting the optimal analysis algorithm, and considering the specifics of the data. Another challenge to tackle in future research is combining different stream processing algorithms in the multi-stage distributed architecture to achieve a higher quality of the resulting model.