ASSESSMENT OF THE EFFICIENCY OF THE SUCCESS PREDICTION MODEL USING MACHINE LEARNING METHODS
DOI:
https://doi.org/10.31891/2307-5732-2024-335-3-47Keywords:
Logistic Regression, SVM, Random Forest, Machine Learning, Python, Scikit-learnAbstract
The paper compares the accuracy of predicting success based on attendance among the following machine learning methods: Logistic Regression, SVM, Random Forest. The following values characterizing classification accuracy were determined: sensitivity, accuracy, specificity, and balanced accuracy. Based on the obtained results, ROC curve graphs were constructed to assess the classifier's ability to correctly recognize positive classes and reject negative classes when the threshold value changes. In order to perform forecasting, data on user ratings and attendance for the first and second semesters of the 2021-2022 academic year were exported from the Moodle database. The total number of records of subject grades and student attendance exported to files was: 1,308,262. To process data from files, an application was written in the Microsoft Visual Studio development environment in the C# language. The CsvHelper library was used to process table columns. The creation of predictive models was performed in the PyCharm development environment in Python using the Scikit-learn library.
After the calculations, it was determined that the random forest method performs the best prediction of success on the input data and has a higher accuracy. Random forest is an ensemble method that usually works better in cases where the relationships between features and the original classes are more complex, non-linear, or when there are many features. It can automatically consider feature importance and make better predictions than linear models on complex data.
The obtained accuracy value of 80% and AUC of 73% indicates the quality of the classification model and good discriminating power of the model. The graph of the ROC curve for the random forest method shows that it has a clearly defined area under the curve that is more curved up and to the left, which shows the effectiveness of the model. This result is a good enough starting point, but in the process of further research, additional refinement may be needed to improve these indicators. It is also important to increase the amount of data, as well as add new features to obtain a more accurate result. Obtaining additional features is possible by creating additional plugins for Moodle.