DEVELOPMENT OF THE MULTIMODAL HANDLING INTERFACE BASED ON GOOGLE API

Authors

DOI:

https://doi.org/10.31891/2307-5732-2024-335-3-2

Keywords:

speech-to-text conversion, speech recognition, Sequence-to-Sequence, machine learning, artificial intelligence

Abstract

In general, the article demonstrates the potential of Google APIs for developing innovative multimodal interfaces that can improve user experience and productivity in various fields.

Objective. The purpose of the work is to create an architectural approach to the processing and analysis of multimodal data.

Methods. This paper discusses the design and implementation of the interface, including the integration of various Google APIs for speech recognition, natural language processing, and gesture recognition. We also describe the technical details of the interface development, including the software and hardware components used, and provide examples of its use in real-world scenarios. In addition, a comparative analysis of existing approaches was carried out in order to select the most advanced and reliable ones for building reliable multimodal audio-to-text systems and conducting further research. The main purpose of the article is to discuss the key stages of building a machine translation strategy based on the Google API. The advantages and disadvantages of a number of methodologies are discussed, including rule-based, statistical, and neural network methodologies. The most appropriate software technique and organizational structure for developing solutions for evaluating multimodal data are identified. In addition, two methods for numbering unstructured data were considered in terms of their software architecture and design. The next step in the development of this research may be the implementation of the recommended architectural approach into a suitable system and its introduction to the market.

Results. The proposed system uses Google Cloud Platform services such as Speech-to-Text, Natural Language Processing, and AutoML to provide a reliable way to combine data from different sources and summarize it into relevant output with a success rate.

Conclusions. The experiments confirmed the feasibility of using a multimodal data processing interface based on Google API and described its architectural solution. This research was aimed at developing an effective and intuitive interface that combines several modalities, such as voice, touch, and gesture recognition, to improve the user experience when interacting with a computer device. This research can be used in the future to create a wide range of speech-to-text models for specific medical fields, which will improve the task of speech translation, reduce workload, and increase the efficiency of time use by medical professionals.

Published

2024-05-30

How to Cite

DEVELOPMENT OF THE MULTIMODAL HANDLING INTERFACE BASED ON GOOGLE API. (2024). Herald of Khmelnytskyi National University. Technical Sciences, 335(3(1), 15-18. https://doi.org/10.31891/2307-5732-2024-335-3-2