COMPARATIVE STUDY OF MACHINE LEARNING METHODS FOR STREAMING DATA PROCESSING

IVAN KHAMAR; IHOR OLENYCH

doi:10.31891/2307-5732-2026-361-54

Authors

IVAN KHAMAR Ivan Franko National University of Lviv Author https://orcid.org/0009-0000-0514-903X
IHOR OLENYCH Ivan Franko National University of Lviv Author https://orcid.org/0000-0002-6642-0222

DOI:

https://doi.org/10.31891/2307-5732-2026-361-54

Keywords:

machine learning, LightGBM, XGBoost, Kafka, forecasting

Abstract

The process of analysing large-scale streaming cryptocurrency data by machine learning algorithms is the object of this research. Handling terabyte-scale, high-velocity data streams presents a critical challenge due to the computational and accuracy limitations of classical machine learning methods, which struggle with the volume and complexity of millions of temporal records. The principal result is the development of a distributed processing pipeline featuring a Feature Store architecture. This solution enabled LightGBM and XGBoost algorithms to achieve superior predictive performance (R² was 0.9998 and 0.9997, respectively) while processing 1.33 million streaming records across 100 cryptocurrency pairs. The research methodology included a comprehensive feature engineering phase, extracting a set of temporal, statistical, and technical indicators, such as rolling means, volatility measures, and lagged price values, which are crucial for capturing dependencies in big data. This performance advantage is attributed to the architectural capabilities of gradient boosting algorithms. The proposed pipeline successfully shifts the process from conventional linear approaches to advanced tree-based ensemble methods with optimized memory management, demonstrating that gradient boosting algorithms possess the necessary computational efficiency and pattern recognition capabilities that Decision Tree, Random Forest, and Regression methods lack. In practice, the findings provide clear guidelines for big data practitioners. The Feature Store architecture with temporal stratified sampling is a scalable framework achieving 5.7x data reduction and near 82% memory savings. For production systems handling high-velocity streaming data, gradient boosting algorithms (particularly LightGBM with 0.63 s training time) are the superior strategy over traditional methods for achieving both accuracy and computational efficiency.

COMPARATIVE STUDY OF MACHINE LEARNING METHODS FOR STREAMING DATA PROCESSING

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Language

Make a Submission

Index

For Avtors

Flag