USING A MICROSERVICE APPROACH IN THE PROCESS OF WEB SCRAPING OF LARGE VOLUMES OF DATA FOR WEBSITES WITH DYNAMIC CONTENT

Authors

DOI:

https://doi.org/10.31891/2307-5732-2023-327-5-243-248

Keywords:

microservice, web scraping, data

Abstract

One of the main challenges of web scraping is handling dynamic content. Modern websites often use technologies such as AJAX and JavaScript to dynamically update content without reloading the page. The problem of web scraping arises from the increasing complexity of web pages that use dynamic content generated by JavaScript. This complicates the data collection process, as standard HTTP request methods cannot retrieve the full content of the page. Microservice architecture can solve this problem because it allows tasks to be distributed among small, independent services. Research and publications analysis shows that commonly used web scraping techniques can be time-consuming when 
scanning large amounts of data. Various approaches are used to solve this problem, such as the fast XPath selector engine. The average reliability and resilience of XPath is 96% of successful requests and increases to 98% when using microservices. XPath provides higher reliability and resilience than other methods. The CSS Selector method is the smallest in terms of bandwidth usage compared to other methods. Using microservice processing methods can provide higher reliability and resilience when parsing large amounts of data, but will require an increase in execution time. The article aims to study the features of the microservice approach in the process of web scraping and consider the main advantages of microservice architecture. The article will explore the peculiarities of using different approaches in accessing website elements, in particular, attention will be paid to the methods of CSS selectors, Regex, and XPath. The study found that microservice architecture can improve system performance but can lead to longer turnaround times. Performance measurements have shown that the Regex method has the lowest CPU and 
memory usage compared to other methods, and the XPath method provides higher reliability and resilience. 

Published

2023-10-31

How to Cite

SUSHYNSKYI, O., KOTSUN, V., SKLIARENKO, O., & LYTVYNENKO, L. (2023). USING A MICROSERVICE APPROACH IN THE PROCESS OF WEB SCRAPING OF LARGE VOLUMES OF DATA FOR WEBSITES WITH DYNAMIC CONTENT. Herald of Khmelnytskyi National University. Technical Sciences, 327(5(2), 243-248. https://doi.org/10.31891/2307-5732-2023-327-5-243-248