ASSESSMENT OF ADEQUACY OF CONTENT BY CONTEXT USING BERT MODEL ENSEMBLE METHODS
DOI:
https://doi.org/10.31891/2307-5732-2023-323-4-118-121Keywords:
neural networks, BERT, RoBERTa, Catboost, XGBoostAbstract
Over the past 20 years, text has dominated the Internet as a means of communicating information. Every day, new people are born and every day, new people sign up for social media. Due to lack of education and attention, people use social media to express their thoughts and virtualize themselves, sometimes forgetting that there is another person on the other side of the monitor. Such processes in human life lead to the reckless or sometimes intentional generation of content that may violate the rules of the communities where this content is produced. One of the primitive and non-scalable examples of dealing with the problem of uncontrolled generation of social content is physical moderation. This method makes sense in private, closed channels of communication with the audience, where the bandwidth of a person as a moderator is sufficient to effectively control the information space. Despite the reliability of humans in terms of information and moderation, humans are lifelong learners, and they are what they read or see. Therefore, there is a possibility that human moderation is biased from the point of view of all people who are in the same information space. The topic of assessing the adequacy and ethics of a text is gaining popularity even as the amount of information generated in social networks increases. The problem is that modern methods of text evaluation are not able to work with differently contextualized data, i.e. a model trained on one data set is tied to the context of the data environment in which this set was collected. The method developed in this paper allows a model to be trained on different contexts and independent datasets, and to directly vote each model in the ensemble for its own version of the truth according to the local context of that model. We will develop a binary message adequacy/normality classifier based on ensemble technology. You can observe the set of methods I have chosen that are optimal in terms of execution time, training time, RAM, and, accordingly, accuracy. In particular, these are methods such as Catboost XGBoost for classification and extraction of features and context, BERT and its sub-type RoBERTa were chosen. I will conduct a corresponding analysis and experiment on these methods to verify that this method is really effective.