A HYBRID APPROACH TO VISUALLY ORIENTED GENERATION OF CULINARY RECIPES BASED ON CONVOLUTIONAL NEURAL NETWORKS AND LARGE LANGUAGE MODELS

SERHII MINUKHIN; MAKSYM SHAPOSHNYK

doi:10.31891/2307-5732-2026-363-57

Authors

SERHII MINUKHIN Simon Kuznets Kharkiv National University of Economics Author https://orcid.org/0000-0002-9314-3750
MAKSYM SHAPOSHNYK Simon Kuznets Kharkiv National University of Economics Author https://orcid.org/0009-0004-0995-4086

DOI:

https://doi.org/10.31891/2307-5732-2026-363-57

Keywords:

Convolutional neural networks, large language models, classification, culinary food, ingredients, recipe, generation, image

Abstract

This article delineates a hybrid approach for visually anchored recipe synthesis, orchestrating a confluence of computer vision and natural language processing. By integrating multi-label Convolutional Neural Networks with Large Language Models, the architecture remediates the inherent opacity found when mapping pixel-level abstractions onto culinary discourse. To rectify the resolution divergence between monolithic dish categorization and granular ingredient composition, this research prioritizes semantic fidelity. The investigative trajectory involved diagnosing the constraints of orthodox single-label classification and subsequently re-engineering the DenseNet-121 topology to accommodate concurrent streams for ingredient identification. Grounded in transfer learning, the ocular engine—trained on the Food-101 corpus—utilizes cost-sensitive optimization to sharpen detection accuracy. Linguistic synthesis proceeds via the Llama 3.1 8B model, instrumented through In-Context Learning and validated through BLEU, ROUGE, and Cosine Similarity benchmarks. Empirical evidence underscores the framework's efficacy; the refined detector yielded a Recall of 0.91. Insofar as visual context was integrated into structured prompts, the mean Cosine Similarity ascended to 0.765, marking a significant leap in capturing nuanced dish variations compared to established baselines. The proposed hybrid approach successfully bridges the semantic gap between visual data and textual generation. Explicitly injecting detected ingredients into the LLM context enables the creation of instance-specific recipes rather than template-based outputs, significantly mitigating AI hallucinations and increasing the relevance of the results.

A HYBRID APPROACH TO VISUALLY ORIENTED GENERATION OF CULINARY RECIPES BASED ON CONVOLUTIONAL NEURAL NETWORKS AND LARGE LANGUAGE MODELS

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Language

Make a Submission

Index

Flag