A BERT-Based Classification Model: The Case of Russian Fairy Tales

Keywords: Bert model, fairy tales, Text classification, Neural networks

Abstract

Introduction: Automatic profiling and genre classification are crucial for text suitability assessment and as such have been in high demand in education, information retrieval, sentiment analysis, and machine translation for over a decade. Of all kinds of genres, fairy tales make one of the most challenging and valuable objects of study due to its heterogeneity and a wide range of implicit idiosyncrasies. Traditional classification methods including stylometric and parametric algorithms, however, are not only labour-intensive and time-consuming, but they are also struggling with identifying corresponding classifying discriminants. The research in the area is scarce, their findings are still controversial and debatable.

Purpose: Our study aims to fill this crucial void and offers an algorithm to range Russian fairy-tales into classes based on the pre-set parameters. We present the latest BERT-based classification model for Russian fairy tales, test the hypothesis of BERT potential for classifying Russian texts and verify it on a representative corpus of 743 Russian fairy tales.

Method: We pre-train BERT using a collection of three classes of documents and fine-tune it for implementation of a specific application task. Focused on the mechanism of tokenization and embeddings design as the key components in BERT’s text processing, the research also evaluates the standard benchmarks used to train classification models and analyze complex cases, possible errors and improvement algorithms thus raising the classification models accuracy. Evaluation of the models performance is conducted based on the loss function, prediction accuracy, precision and recall.

Results: We validated BERT’s potential for Russian text classification and ability to enhance the performance and quality of the existing NLP models. Our experiments with cointegrated/rubert-tiny, ai forever/ruBert-base, and DeepPavlov/rubert-base-cased-sentence on different classification tasks demonstrate that our models achieve state-of-the-art results with the best accuracy of 95.9% in cointegrated/rubert-tiny thus outperforming the other two models by a good margin. Thus, the achieved by AI classification accuracy is so high that it can compete with that of human expertise.   

Conclusion: The findings highlight the importance of fine-tuning for classifying models. BERT demonstrates great potential for improving NLP technologies and contributing to the quality of automatic text analysis and offering new opportunities for research and application in a wide range  of areas including identification and arrangement of all types of content-relevant texts thus contributing to decision making. The designed and validated algorithm can be scaled for classification of as complex and ambiguous discourse as fiction thus improving our understanding of text specific categories.  Considerably bigger datasets are required for these purposes.

 

 

Downloads

Download data is not yet available.

References

Aarne, A. (1910). Verzeichnis der Märchentypen. [List of Fairy Tale Types]. Folklore Fellows' Communications. 3 Helsinki: Suomalaisen Tiedeakatemian Toimituksia, (in German).
Andreev, N. P. (1929). Index of Fairy-Tale Plots According to the Aarne System, L.: Russian Geographical Society.
Atagün, E., Hartoka, B. & Albayrak A. (2021). Topic Modeling Using LDA and BERT Techniques: Teknofest Example,” In 6th International Conference on Computer Science and Engineering (UBMK). 660–664. doi.org/10.1109/UBMK52708.2021.9558988
Barros, L., Rodriguez, P., & Ortigosa, A. (2013). Automatic Classification of Literature Pieces by Emotion Detection: A Study on Quevedo's Poetry. In Humaine Association Conference on Affective Computing and Intelligent Interaction, 141–146.
Batraeva, I. A., Nartsev, A. D., & Lezgyan, A.S. (2020). Using the analysis of semantic proximity of words in solving the problem of determining the genre of texts within deep learning”, Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie vychislitelnaja tehnika i informatika, 50, 14–22. (In Russian).
Bayer, M., Kaufhold, M.-A., & Reuter, Ch. (2021). A survey on data augmentation for text classification. arXiv preprint. arXiv:2107.03158
Chan, B., Schweter, S., & Möller, T. (2020). German’s Next Language Model. In Proceedings of the 28th International Conference on Computational Linguistics, 6788–6796, Barcelona (Online): International Committee on Computational Linguistics.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint. arXiv:1810.04805.
Dubovik, A.R. (2017). Automatic text style identification in terms of statistical parameters, Komp’yuternaya lingvistika i vychislitel’nye ontologii, 1, 29–45. (In Russian).
Fu, Z., Zhou W., Xu J., Zhou H., & Li L. (2022). Contextual Representation Learning beyond Masked Language Modeling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2701-2714. Dublin: Association for Computational Linguistics.
El-Halees, A. M. (2017). Arabic Text Genre Classification. In Journal of Engineering Research and Technology 4(3), 105–109.
Gerasimenko, N.A., Chernyavsky, A.S. & Nikiforova, M.A. (2022) ruSciBERT: A Transformer Language Model for Obtaining Semantic Embeddings of Scientific Texts in Russian. Dokl. Math. 106 (Suppl. 1), 95–96. doi.org/10.1134/S1064562422060072
Jin, Q., Xue, X., Peng, W., Cai, W., Zhang, Y., Zhang, L. (2020). Tblc-rattention: A deep neural network model for recognizing the emotional tendency of Chinese medical comment. IEEE Access 8, 96811–96828
Jwa, H. D. Oh, K. Park, J. M. Kang, & H. Lim (2019). exBAKE: Automatic fake news detection model based on Bidirectional Encoder Representations from Transformers (BERT), Applied Sciences (Switzerland), 9(19), 4062. 10.3390/app9194062
Karsdorp, F. & Bosch, Van den A. (2013). Identifying Motifs in Folktales using Topic Models. In Proceedings of BENELEARN 2013, 41–49. Neijmegen: Radboud University.
Kelodjoue, E., Goulian, J., & Schwab Didier (2022). Performance of two French BERT models for French language on verbatim transcripts and online posts. In Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022), 88–94. Trento: Association for Computational Linguistics.
Kessler B., Numberg G. & Schütze H. (1997). Automatic detection of text genre. In Proceedings of the eighth conference on European chapters of the Association for Computational Linguistics, 32–38.
Kupriyanov, R.V., Solnyshkina, M.I. & Lekhnitskaya, P.A. (2023). Parametric taxonomy of educational texts In Vestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriya 2. Yazykoznanie , 22(6), 80–94. doi.org/10.15688/jvolsu2.2023.6.6
Labusch, K., Kulturbesitz, P., Neudecker, C., & Zellhofer, D. (2019). Bert for named entity recognition in contemporary and historical German. In Proceedings of the 15th conference on natural language processing, 9–11.
Lagutina, K. V., Lagutina, N. S., & Boychuk, E. I. (2021). Text classification by genre based on rhythm features. Modeling and analysis of information systems, 28(3), 280–291.
Lagutina, K. V. (2023). Genre Classification of Russian Texts Based on Modern Embeddings and Rhythm. Automatic Control and Computer Sciences, 57(7), 817–827.
Lai, Y. A., Lalwani, G. & Zhang, Y. (2020). Context Analysis for Pre-trained Masked Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2020, 3789–3804, Online. Association for Computational Linguistics.
Liebeskind, Ch., Liebeskind, Sh., & Bouhnik, D. (2023) Machine Translation for Historical Research: A Case Study of Aramaic-Ancient Hebrew Translations. Journal on Computing and Cultural Heritage, 17(2), 1–23. doi.org/10.1145/3627168
Leitner, E., Rehm, G., & Moreno-Schneider, J. (2020). A dataset of German legal documents for named entity recognition. arXiv preprint. arXiv:2003.13016
Lippert, Ch., Junger, A., Golam R., Md., Mohammad Ya., Hasan Sh., Md & Chowdhury, Md. (2022). Kuzushiji (Japanese Text) Classification. Technical Report. · 10.13140/RG.2.2.22416.07680
Liu, C., Zhao, Y., Cui X. & Zhao, Y. (2022) A comparative research of different granularities in Korean text classification. In IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA), Dalian, 486–489. doi.org/10.1109/AEECA55500.2022.9919047
Martin, L., Muller, B., Suárez, P. J. O., Dupont, Y., Romary, L., Villemonte de La Clergerie, É., Seddah, D., & Sagot, B. (2019). Camembert: a tasty French language model. arXiv preprint. arXiv:1911.03894
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao, J. (2021). Deep learning-based text classification: a comprehensive review. ACM computing surveys (CSUR) 54, 1–40.
Nikolaev, P.L. (2022) Classification of books into genres based on text descriptions via deep learning. International Journal of Open Information Technologies, 10(1), 36–40 (in Russian)
Nguyen, D., Trieschnigg, D., Meder, Th., & Theune, M. (2012). Automatic classification of folk narrative genres. In Proceedings of the Workshop on Language Technology for Historical Text, Vienna. 378–382. www.oegai.at/konvens2012/proceedings/56_nguyen12w/
Nguyen, D., Trieschnigg, D., Meder, Th., & Theune, M. (2013) Folktale classification using learning to rank. In Proceedings of the European Conference on Information Retrieval. Lecture Notes in Computer Science, 7814, 195–206. Springer, Berlin, Heidelberg. doi.org/10.1007/978-3-642-36973-5_17
Ostrow, R. A., (2022). Heroes, Villains, and the In-Between: A Natural Language Processing Approach to Fairy Tales. Senior Projects Spring, 275.
Parida, U., Nayak, M., Nayak, A.K., (2021) News text categorization using random forest and naive bayes. In 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology (ODICON), Bhubaneswar, 1–4. doi.org/ 10.1109/ODICON50556.2021.9428925
Peters, M., E., Neumann, M., Iyyer, M., Gardner, M., Clark, Ch., Lee, K. & Zettlemoyer, L. (2018). Deep contextualized word representations. ArXiv, abs/1802.05365.
Pompeu, D. P. (2019). Interpretable Deep Learning Methods for Classifying Folktales According to the Aarne-Thompson-Uther Scheme. Master’s Thesis, Instituto Superior Técnico.
Propp, V. (1984). The Russian fairy tale. L.: Izd. LSU.
Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. (2021) Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. In NPJ Digit Med, 4(1), 86. doi.org/10.1038/s41746-021-00455-y
Reusens, M., Stevens, A., Tonglet, J., De Smedt, J., Verbeke, W., Vanden Broucke, S., & Baesens, B. (2024). Evaluating text classification: A benchmark study. Expert Systems with Applications, 254. 124302. 10.1016/j.eswa.2024.124302
Sabharwal, N. & Agrawal, A. (2021). BERT Model Applications: Question Answering System in Hands-on Question Answering Systems with BERT, Springer, 97–137. doi.org/10.1007/978-1-4842-6664-9
Samothrakis, В. S., & Fasli, M. (2015). Emotional sentence annotation helps predict fiction genre. PloS one, 10(11), e0141922
Santoro, A. & Faulkner, R. & Raposo, D. & Rae, J. & Chrzanowski, M. & Weber, Th. & Wierstra, D. & Vinyals, O. & Pascanu, R. & Lillicrap, T. (2018). Relational recurrent neural networks. arXiv. doi.org/10.48550/arXiv.1806.01822
Solnyshkina, M.I., Kupriyanov, R.V. & Shoeva, G.N. (2024). Linguistic profiling of text: adventure story vs. Textbook. In Scientific result. Questions of theoretical and applied linguistics, 10(1), 115-132. doi.org/10.18413/2313-8912-2024-10-1-0-7
Solovyev, V., Solnyshkina, M., & Tutubalina, E. (2023). Topic Modeling for Text Structure Assessment: The case of Russian Academic Texts. In Journal of Language and Education, 9(3), 143-158. doi.org/10.17323/jle.2023.16604
Sun, F., Liu, J., Wu, J., Pei, Ch., Lin, X., Ou, W. & Jiang P. (2019). BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM '19). Association for Computing Machinery, New York, NY, USA, 1441–1450. doi.org/10.1145/3357384.3357895
Tangherlini, T. & Chen, R. (2024). Travels with BERT: Surfacing the intertextuality in Hans Christian Andersen's travel writing and fairy tales through the network lens of large language model‐based topic modeling. Orbis Litterarum 79(6), 519–562. doi.org/10.1111/oli.12458
Tianqi, Ch. & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, 785–794, New York, NY, USA. ACM. doi.org/10.1145/2939672.2939785
Tomin, E., Solnyshkina, M., Gafiyatova, E. & Galiakhmetova, A. (2023). Automatic Text Classification as Relevance Measure for Russian School Physics Texts In 2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Singapore, 366-370. doi.org/10.1109/MCSoC60832.2023.00061
Tudorovskaya, E.A. (1961). On classification of Russian folk fairy tales. Specifics of Russian folklore genres. In Abstracts of reports. Gorky, 55-64.
Uther, H.-J. (2004). The types of international folktales: A classification and bibliography, based on the system of Antti Aarne and Stith Thompson. Folklore Fellows' Communications, 3 volumes, 284–286. Helsinki, FI: Suomalainen Tiedeakatemia.
Thompson, S. (1928). The Types of the Folk-Tale: A classification and bibliography. Antti Aarne's Verzeichnis der Märchentypen, translated and enlarged. Folklore Fellows' Communications, 74. Helsinki, FI: Suomalainen Tiedeakatemia.
Thompson, S. (1977). The Folktale. Berkeley, CA: University of California Press.
Wang, Z., Wu, H. Liu, H.& Cai, Q.-H. (2020). BertPair-Networks for Sentiment Classification, in 2020 International Conference on Machine Learning and Cybernetics (ICMLC), 273–278. doi.org/10.1109/ICMLC51923.2020.9469534
Worsham, В, J., & Kalita, J. (2018). Genre identification and the compositional effect of genre in literature. In Proceedings of the 27th international conference on computational linguistics, 1963–1973.
Xiong, H. & Wu, J. & Liu, L. (2010). Classification with ClassOverlapping: A Systematic Study. In 1st International Conference on E-Business Intelligence (ICEBI 2010), 303–309. Atlantis Press. doi.org/10.2991/icebi.2010.43
Published
2024-12-30
How to Cite
SolovyevV., SolnyshkinaM., TenA., & ProkopyevN. (2024). A BERT-Based Classification Model: The Case of Russian Fairy Tales. Journal of Language and Education, 10(4), 98-111. https://doi.org/10.17323/jle.2024.24030