IMPROVEMENT OF THE DETERMINISTIC METHOD OF THE TEXT DATA CORPORA GENERATION
DOI:
https://doi.org/10.31891/2307-5732-2024-333-2-69Keywords:
natural language processing, software quality assurance, text data corpora, corpora generation, CorDeGen methodAbstract
This paper is devoted to the issue of generating corpora of text data for their use in solving software engineering problems in the context of developing information systems for natural language processing (NLP). One of the methods intended for this is the basic CorDeGen method, however, the analysis revealed certain disadvantages. Such a disadvantage is that NLP methods at the pre-processing stage can remove part of the terms generated by this method from the texts, treating them as stop words of a certain language. Removing part of the terms leads to the fact that the distribution of terms between the texts predicted by the CorDeGen method is distorted and the obtained result of processing the corpus with a certain NLP method is significantly different from the expected one.
To solve this disadvantage, a new modified CorDeGen+ method is proposed in the paper, which introduces an additional, language-dependent stage of checking each generated term for admissibility, and if necessary, replacing it with another one. At the same time, all the advantages of the basic CorDeGen method are preserved by the proposed method, as well as other possible disadvantages, except for the corrected one. The paper examines language variations of the proposed method for the four most common European languages and a variation for languages that use non-Latin letters.
The conducted experimental test showed the effectiveness of the CorDeGen+ method in terms of correcting the described disadvantage of the basic CorDeGen method. Also, this test showed that the degree of slowdown of the corpora generation process due to the introduction of an additional stage depends on the corpus size. In the case of micro-corpora (100 unique terms), the degree of slowdown reaches 39%, but as the size of the corpus increases, the degree drops sharply, and for super-large corpora (312500 unique terms) it reaches a maximum of 6.8%.