NEURAL NETWORKS FOR OBJECT RECOGNITION IN IMAGES COMPARATIVE ANALYSIS
DOI:
https://doi.org/10.31891/2307-5732-2025-357-11Keywords:
R-CNN, YOLO, object recognition, computer visionAbstract
This article presents a systematic analysis of the architectural evolution in object detection networks, tracing pivotal advancements from the pioneering Region-based Convolutional Neural Network (R-CNN) framework to contemporary YOLOv8 implementations. Early two-stage methodologies, including Fast R-CNN and Faster R-CNN, established foundational region-proposal paradigms but faced inherent computational inefficiencies. The paradigm shift toward unified, single-shot detection was catalyzed by the Single Shot Detector (SSD) and You Only Look Once (YOLO) architecture, which enabled real-time inference through end-to-end spatial grid processing.
Subsequent iterations of YOLO demonstrate progressive resolution of critical limitations. YOLOv1-v3 introduced multi-scale feature hierarchies via Feature Pyramid Networks (FPN) and anchor-based localization, enhancing small-object detection. YOLOv4 integrated cross-stage partial networks (CSPDarknet) and path aggregation (PANet) to optimize gradient flow, while advanced augmentation strategies improved robustness. YOLOv5/v6 refined these principles with hardware-aware optimizations, supporting streamlined deployment via ONNX and TensorRT runtimes.
Later innovations address persistent challenges: YOLOv7's extended efficient layer aggregation (E-ELAN) enhanced parameter utilization, whereas YOLOv8's anchor-free mechanism and Distribution Focal Loss (DFL) significantly improved bounding box precision. Throughout this evolution, techniques such as decoupled heads and post-training quantization have progressively mitigated constraints related to computational latency, model size, and edge-device compatibility.
Quantitative advances in mean average precision (mAP) and frames-per-second (FPS) metrics are contextualized against real-world applications in autonomous navigation, industrial automation, and surveillance. The study concludes by identifying emergent research trajectories, including lightweight model distillation, 3D scene understanding, and multimodal vision-language integration.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 ЯРОСЛАВ ГОЗАК, СЕРГІЙ ПАЛІЙ (Автор)

This work is licensed under a Creative Commons Attribution 4.0 International License.