METHODOLOGY OF WEB PAGE PARSING FOR AUTOMATED COLLECTION OF HETEROGENEOUS DATA
DOI:
https://doi.org/10.31891/Keywords:
parsing, structured data, unstructured data, web scraping, information collection automation, websites, content analysis, BeautifulSoup, Scrapy, semantic analysis, text processing, information systemsAbstract
The article examines the methodology of web page parsing as a comprehensive and systematic framework of methods and tools that enables efficient extraction of both structured data (tables, lists, JSON, XML) and unstructured data (text blocks, articles, news reports) from websites. A generalized architecture of the parsing process is proposed, encompassing the stages of data source analysis, tool selection (e.g., BeautifulSoup, Scrapy, Selenium, Puppeteer), construction of parsing rules, handling exceptional cases (captchas, JavaScript rendering, dynamically loaded content), as well as storing results in formats suitable for subsequent large-scale analysis and the integration into information systems involves a thorough comparative evaluation of methods designed for the processing of structured and unstructured data. The analysis focuses on approaches such as semantic interpretation, the application of regular expressions, advanced natural language processing (NLP) technologies, and algorithmic solutions based on machine learning. The study identifies key challenges in automated information retrieval, such as access policy restrictions, legal and ethical considerations, the complexity of multimodal content processing, and frequent changes in the underlying structure of web pages. Practical recommendations are offered to improve the robustness and adaptability of parsers against modifications of online resources while maintaining compliance with data ethics standards. The effectiveness of the proposed methodology is validated by the development of an automated news monitoring system that performs regular extraction of publication texts, their preprocessing, and structured storage in a database for subsequent thematic classification and knowledge discovery. The results of the study may be utilized in practice not only in the design of practical information systems that require continuous and scalable data collection, but also in scientific research domains where objective, reproducible, and high-volume information extraction from the web environment serves as a key factor.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 ДЕНИС ІВАНОВ, ІГОР ГАРКУША (Автор)

This work is licensed under a Creative Commons Attribution 4.0 International License.