METHOD FOR DETECTING SUBJECT AREA TARGET OBJECTS IN TEXT CONTENT
DOI:
https://doi.org/10.31891/2307-5732-2024-343-6-23Keywords:
machine learning, NLP, NER, named entities, target objectsAbstract
Method for detecting subject area target objects is proposed, which uses machine learning algorithms for adaptive object recognition, taking into account the specifics of the subject area, which allows to significantly reduce the time of data processing and reduce the risk of losing important information. The method for detecting subject area target objects allows to transform input data in the form of the researched text and a pre-processed and balanced corpus of texts of the researched subject area into output data in the form of a formed set of target objects from the researched text, which is a combined set of keywords found by various methods without repetitions and the set of NER grouped by lemmatization. The proposed method for detecting subject area target objects differs from the existing ones by taking into account keywords and noun entities of the subject area, which made it possible to increase accuracy of detection of detecting subject area target objects as result of taking into account noun entities.
To research the effectiveness of developed method for detecting subject area target objects, a training dataset of 400 texts in the Ukrainian language was formed. Also, for the validation of proposed method, the software was developed to convert the text content of files from the test sample into a set of target objects of the subject area; separate console software was created to use the obtained list of target objects for the studied texts and dictionaries from the subject areas selected according to the dataset. The performed study of the effectiveness of the developed method for detecting subject area target objects revealed that the target objects of the subject areas found by the method are able to perform the further task of classification, demonstrating on the metric of Euclidean distances the grouping of texts of the same category and the increase of the distance orthogonal to it. This determines the beneficial effect and scope of the developed method.