INFLUENCE OF THE MORPHOLOGY OF TEXT AND IMAGE VECTOR TRANSFORMATION LAYERS ON THE ACCURACY OF THE CLIP MODEL
DOI:
https://doi.org/10.31891/2307-5732-2023-329-6-181-182Keywords:
neural networks, CLIP, image description, vector transformationsAbstract
Searching for ways to establish relationships between images and text is complex, due to the vast array of variations, forms, and representations of identical objects in both mediums. Since the CLIP model's introduction in 2021, the field has seen rapid growth, leading to the development of new models based on CLIP. These are extensively used for generating images from text, image inpainting, and image description. The significance of this research lies in enhancing methods for analyzing the interplay between text and visual data in advanced AI models, like CLIP, which employ multiple neural networks. This enhancement is crucial for improving accuracy and efficiency in processing information, which is particularly important in computer vision and natural language processing. The primary aim of this study is to explore how modifications in the transformation layers of the CLIP model, which adjust the lengths of text and image vectors, affect its accuracy. The experiments utilized image encoders based on ResNet-50 and ViT-B/32, the text encoder BERT, and various combinations and types of neural network's hidden layers. The results demonstrate that using multiple linear layers with a normalization layer and progressively shortening the data vectors can enhance the CLIP model's accuracy by 10-15%, varying with the loss function and image encoders used in training. However, significantly reducing the vector lengths for textual and visual data, or employing too many neural layers for processing, can detrimentally affect the model's accuracy. The architectural solutions proposed in the research are tailored to address these challenges. They focus on optimizing the morphology of transformation layers and carefully adjusting the size of the vectors to ensure that the model retains enough information for accurate analysis while not being burdened by unnecessary data or complexity. The study not only contributes to the ongoing development of more accurate and efficient AI models for handling complex text and image relationships but also provides insights into the importance of balance and precision in AI architecture design.