РОЗПІЗНАВАННЯ МІКРОВИРАЗІВ ЗА ДОПОМОГОЮ АРХІТЕКТУРИ ТРАНСФОРМЕРА

ОЛЕКСАНДР ЯРЕМЧЕНКО; ПЕТРО ПУКАЧ

doi:10.31891/2307-5732-2025-353-4

Authors

OLEKSANDR YAREMCHENKO Lviv Polytechnic National University Author https://orcid.org/0009-0001-2002-2704
PETRO PUKACH Lviv Polytechnic National University Author https://orcid.org/0000-0002-0359-5025

DOI:

https://doi.org/10.31891/2307-5732-2025-353-4

Keywords:

Hierarchical transformer, Micro-expression recognition, Deep learning, Facial muscle movement, Local self attention

Abstract

Facial expressions are closely linked to the movements and contractions of facial muscles, where distinct muscle activations reflect different emotional states. In the case of micro-expressions—brief, involuntary facial expressions—these muscle movements are extremely subtle and fleeting, often lasting less than half a second. This subtlety poses a significant challenge for current facial emotion recognition algorithms, many of which are designed for more overt and prolonged expressions. As a result, existing models often struggle with micro-expression recognition due to the low intensity and short duration of the facial cues. Many state-of-the-art approaches employ self-attention mechanisms to model relationships between tokens in a temporal sequence. However, these models typically overlook the intrinsic spatial relationships among facial landmarks, which are essential for understanding the fine-grained muscle movements involved in micro-expressions. The lack of spatial awareness can lead to sub-optimal performance, especially when trying to detect minimal and localized muscle activity. To address this, we propose a novel Hierarchical Transformer Network (HTNet), specifically designed to enhance the recognition of micro-expressions by learning the localized facial muscle dynamics more effectively. HTNet consists of two core components: a transformer layer that captures local temporal features and an aggregation layer that extracts both local and global semantic representations of facial activity. The model partitions the face into four key regions: left lip, right lip, left eye, and right eye. Each region is processed independently by the transformer layer using localized self-attention to focus on minor muscle movements. The aggregation layer then learns the inter-regional interactions, especially between the eye and lip areas. Our experiments, conducted on four widely used micro-expression datasets, demonstrate that HTNet significantly outperforms existing methods, establishing a new benchmark for accuracy and robustness in micro-expression recognition tasks.

MICRO-EXPRESSION RECOGNITION USING TRANSFORMER ARCHITECTURES

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Language

Make a Submission

Index

For Avtors

Flag