OPTIMIZATION OF MEMORY USE BY IMPLEMENTATIONS OF  THE BASIC CORDEGEN METHOD

Authors

DOI:

https://doi.org/10.31891/2307-5732-2025-353-3

Keywords:

natural language processing, corpora generation, CorDeGen method, memory consumption optimization

Abstract

This paper is devoted to the task of generating text corpora “on demand” as input data for solving software engineering problems during the development of information systems for their processing. One of the methods that solves such a task is the basic CorDeGen method, however, as the analysis showed, practically none of the existing studies consider the issue of optimizing practical metrics of software implementations of this method, such as memory usage. Only a few papers propose pre-allocating memory for generated texts “with excess” to simplify and speed up the generation process by removing unnecessary checks and constant memory allocation. However, this approach, implemented as a fast heuristic formula, leads to increased memory usage in most cases.

To solve this shortcoming, the paper proposes a formula for exactly estimating the length of each text by its ordinal index, depending on the input parameters of the basic CorDeGen method (the number of unique terms). This formula considers the set of terms that occur in a particular text, their length, the number of occurrences of each, as well as the length of separators between occurrences of the same term and between different terms.

The experimental verification showed the effectiveness of using the proposed formula for exact text length estimation in terms of reducing memory consumption by the reference implementation of the basic CorDeGen method and its parallel modifications. The efficiency increases with the corpus size – from 3% for small (2500 unique terms) to 10% for very large corpora (312500 unique terms) compared to using the existing fast heuristic formula. At the same time, the degree of slowdown in the generation process decreases with increasing corpus size – from 17 to 6 percent at the same size. In practice, reducing memory consumption by increasing “net” generation time can be especially useful for systems with little or limited available memory to avoid memory overuse.

Downloads

Published

2025-06-16

How to Cite

YUSYN, Y. (2025). OPTIMIZATION OF MEMORY USE BY IMPLEMENTATIONS OF  THE BASIC CORDEGEN METHOD. Herald of Khmelnytskyi National University. Technical Sciences, 353(3.2), 29-35. https://doi.org/10.31891/2307-5732-2025-353-3