A BERT-Based Classification Model: The Case of Russian Fairy Tales

Valery Solovyev; Marina Solnyshkina; Андрей Тен; Nikolai Prokopyev

doi:10.17323/jle.2024.24030

Valery Solovyev Kazan Federal University, Kazan, Russia https://orcid.org/0000-0003-4692-2564
Marina Solnyshkina Kazan Federal University, Kazan, Russia https://orcid.org/0000-0003-1885-3039
Андрей Тен Nobilis.Team, Kazan, Russia
Nikolai Prokopyev TAS Institute of Applied Semiotics, Kazan, Russia https://orcid.org/0000-0003-0066-7465

DOI: https://doi.org/10.17323/jle.2024.24030

Keywords: Bert model, fairy tales, Text classification, Neural networks

Abstract

Introduction: Automatic profiling and genre classification are crucial for text suitability assessment and as such have been in high demand in education, information retrieval, sentiment analysis, and machine translation for over a decade. Of all kinds of genres, fairy tales make one of the most challenging and valuable objects of study due to its heterogeneity and a wide range of implicit idiosyncrasies. Traditional classification methods including stylometric and parametric algorithms, however, are not only labour-intensive and time-consuming, but they are also struggling with identifying corresponding classifying discriminants. The research in the area is scarce, their findings are still controversial and debatable.

Purpose: Our study aims to fill this crucial void and offers an algorithm to range Russian fairy-tales into classes based on the pre-set parameters. We present the latest BERT-based classification model for Russian fairy tales, test the hypothesis of BERT potential for classifying Russian texts and verify it on a representative corpus of 743 Russian fairy tales.

Method: We pre-train BERT using a collection of three classes of documents and fine-tune it for implementation of a specific application task. Focused on the mechanism of tokenization and embeddings design as the key components in BERT’s text processing, the research also evaluates the standard benchmarks used to train classification models and analyze complex cases, possible errors and improvement algorithms thus raising the classification models accuracy. Evaluation of the models performance is conducted based on the loss function, prediction accuracy, precision and recall.

Results: We validated BERT’s potential for Russian text classification and ability to enhance the performance and quality of the existing NLP models. Our experiments with cointegrated/rubert-tiny, ai forever/ruBert-base, and DeepPavlov/rubert-base-cased-sentence on different classification tasks demonstrate that our models achieve state-of-the-art results with the best accuracy of 95.9% in cointegrated/rubert-tiny thus outperforming the other two models by a good margin. Thus, the achieved by AI classification accuracy is so high that it can compete with that of human expertise.

Conclusion: The findings highlight the importance of fine-tuning for classifying models. BERT demonstrates great potential for improving NLP technologies and contributing to the quality of automatic text analysis and offering new opportunities for research and application in a wide range of areas including identification and arrangement of all types of content-relevant texts thus contributing to decision making. The designed and validated algorithm can be scaled for classification of as complex and ambiguous discourse as fiction thus improving our understanding of text specific categories. Considerably bigger datasets are required for these purposes.

Downloads

Download data is not yet available.

References

Aarne, A. (1910). Verzeichnis der Märchentypen [List of fairy tale types]. Folklore Fellows' Communications, (3). Suomalaisen Tiedeakatemian Toimituksia.

Andreev, N. P. (1929). Index of fairy-tale plots according to the Aarne System.Russian Geographical Society.

Atagün, E., Hartoka, B. & Albayrak A. (2021). Topic modeling using LDA and BERT Techniques: Teknofest example.6th International Conference on Computer Science and Engineering (pp. 660-664). Akdeniz University Publisher. DOI: https://doi.org/10.1109/UBMK52708.2021.9558988

Barros, L., Rodriguez, P., & Ortigosa, A. (2013). Automatic classification of literature pieces by emotion detection: A study on Quevedo's poetry. Humaine Association Conference on Affective Computing and Intelligent Interaction (pp. 141-146). IEEE. DOI: https://doi.org/10.1109/ACII.2013.30

Batraeva, I. A., Nartsev, A. D., & Lezgyan, A.S. (2020). Using the analysis of semantic proximity of words in solving the problem of determining the genre of texts within deep learning, Tomsk State University Journal of Control and Computer Science, 50, 14-22. DOI: https://doi.org/10.17223/19988605/50/2

Bayer, M., Kaufhold, M.-A., & Reuter, Ch. (2021). A survey on data augmentation for text classification. arXiv preprint. arXiv:2107.03158. DOI: https://doi.org/10.48550/arXiv.2107.03158

Chan, B., Schweter, S., & Möller, T. (2020). German's next language model. In Proceedings of the 28th International Conference on Computational Linguistics (pp. 6788-6796).International Committee on Computational Linguistics. DOI: https://doi.org/10.18653/v1/2020.coling-main.598

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint. arXiv:1810.04805,. DOI: https://doi.org/10.48550/arXiv.1810.04805

Dubovik, A.R. (2017). Automatic text style identification in terms of statistical parameters. Komp'yuternaya lingvistika i vychislitel'nye ontologii, 1, 29-45. DOI: https://doi.org/10.17586/2541-9781-2017-1-29-45

Fu, Z., Zhou W., Xu J., Zhou H., & Li L. (2022). Contextual representation learning beyond Masked Language Modeling. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers, pp. 2701-2714). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2022.acl-long.193

El-Halees, A. M. (2017). Arabic text genre classification. Journal of Engineering Research and Technology, 4(3), 105-109.

Gerasimenko, N.A., Chernyavsky, A.S. & Nikiforova, M.A. (2022) ruSciBERT: A transformer language model for obtaining semantic embeddings of scientific texts in Russian. Doklady Mathematics, 106 (Suppl. ١), ٩٥-٩٦. DOI: https://doi.org/10.1134/S1064562422060072

Jin, Q., Xue, X., Peng, W., Cai, W., Zhang, Y., Zhang, L. (2020). TBLC-rAttention: A deep neural network model for recognizing the emotional tendency of Chinese medical comment. IEEE Access, 8, 96811-96828. DOI: https://doi.org/10.1109/ACCESS.2020.2994252

Jwa, H. D. Oh, K. Park, J. M. Kang, & H. Lim (2019). exBAKE: Automatic fake news detection model based on Bidirectional Encoder Representations from Transformers (BERT). Applied Sciences, 9(19), 4062. DOI: https://doi.org/10.3390/app9194062

Karsdorp, F. & Bosch, Van den A. (2013). Identifying motifs in folktales using topic models. Proceedings of BENELEARN 2013 (pp. 41-49). Radboud University. https://hdl.handle.net/2066/112943.

Kelodjoue, E., Goulian, J., & Schwab D. (2022). Performance of two French BERT models for French language on verbatim transcripts and online posts. Proceedings of the 5th International Conference on Natural Language and Speech Processing (pp. 88-94). Association for Computational Linguistics. https://aclanthology.org/2022.icnlsp-1.10.

Kessler B., Numberg G. & Schütze H. (1997). Automatic detection of text genre. Proceedings of the Eighth Conference on European chapters of the Association for Computational Linguistics. (pp. 32-38). Association for Computational Linguistics. DOI: https://doi.org/10.3115/976909.979622

Kupriyanov, R.V., Solnyshkina, M.I. & Lekhnitskaya, P.A. (2023). Parametric taxonomy of educational texts. Science Journal of VolSU. Linguistics, 22(6), 80-94. DOI: https://doi.org/10.15688/jvolsu2.2023.6.6

Labusch, K., Kulturbesitz, P., Neudecker, C., & Zellhofer, D. (2019). BERT for named entity recognition in contemporary and historical German. Proceedings of the 15th Conference on Natural Language Processing (pp. 9-11). Erlangen.

Lagutina, K. V., Lagutina, N. S., & Boychuk, E. I. (2021). Text classification by genre based on rhythm features. Modeling and Analysis of Information Systems, 28(3), 280-291. DOI: https://doi.org/10.18255/1818-1015-2021-3-280-291

Lagutina, K. V. (2023). Genre classification of Russian texts based on Modern Embeddings and Rhythm. Automatic Control and Computer Sciences, 57(7), 817-827. DOI: https://doi.org/10.3103/S0146411623070076

Lai, Y. A., Lalwani, G. & Zhang, Y. (2020). context analysis for pre-trained masked language models. Findings of the Association for Computational Linguistics (pp. 3789-3804). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.338

Liebeskind, Ch., Liebeskind, Sh., & Bouhnik, D. (2023) Machine translation for historical research: A case study of Aramaic-Ancient Hebrew translations. Journal on Computing and Cultural Heritage, 17(2), 1-23. DOI: https://doi.org/10.1145/3627168

Leitner, E., Rehm, G., & Moreno-Schneider, J. (2020). A dataset of German legal documents for named entity recognition. arXiv preprint. arXiv:2003.13016. DOI: https://doi.org/10.48550/arXiv.2003.13016

Lippert, Ch., Junger, A., Golam R., Md., Mohammad Ya., Hasan Sh., Md, & Chowdhury, Md. (2022). Kuzushiji (Japanese Text) classification. Technical Report. https://doi.org. DOI: https://doi.org/10.13140/RG.2.2.22416.07680

Liu, C., Zhao, Y., Cui X. & Zhao, Y. (2022) A comparative research of different granularities in Korean text classification. In IEEE International Conference on Advances in Electrical Engineering and Computer Applications (pp. 486-489). CONF-CDS. Publisher. DOI: https://doi.org/10.1109/AEECA55500.2022.9919047

Martin, L., Muller, B., Suárez, P. J. O., Dupont, Y., Romary, L., Villemonte de La Clergerie, É., Seddah, D., & Sagot, B. (2019). Camembert: A tasty French language model. arXiv preprint. arXiv:1911.03894. DOI: https://doi.org/10.18653/v1/2020.acl-main.645

Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao, J. (2022). Deep learning-based text classification: A comprehensive review. ACM Computing Surveys, 54(3), 1-40. DOI: https://doi.org/10.1145/3439726

Nikolaev, P.L. (2022) Classification of books into genres based on text descriptions via deep learning.International Journal of Open Information Technologies, 10(1), 36-40.

Nguyen, D., Trieschnigg, D., Meder, Th., & Theune, M. (2012). Automatic classification of folk narrative genres. Proceedings of the KONVENS 2012 (pp. 378-382). ASAI. http://www.oegai.at/konvens2012/proceedings/56_nguyen12w.

Nguyen, D., Trieschnigg, D., Meder, Th., & Theune, M. (2013) Folktale classification using learning to rank. Proceedings of the European Conference on Information Retrieval. Lecture Notes in Computer Science (vol. 7814, pp. 195-206). Springer. DOI: https://doi.org/10.1007/978-3-642-36973-5_17

Ostrow, R. A., (2022). Heroes, villains, and the in-between: A Natural Language Processing approach to fairy tales. Senior Projects Spring, 275.

Parida, U., Nayak, M., Nayak, A.K., (2021) News text categorization using random forest and naive bayes. In 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology (pp. 1-4). IEEE. DOI: https://doi.org/10.1109/ODICON50556.2021.9428925

Peters, M., E., Neumann, M., Iyyer, M., Gardner, M., Clark, Ch., Lee, K. & Zettlemoyer, L. (2018). Deep contextualized word representations. ArXiv, abs/1802.05365. DOI: https://doi.org/10.18653/v1/N18-1202

Pompeu, D. P. (2019).Interpretable deep learning methods for classifying folktales according to the Aarne-Thompson-Uther Scheme [Master's Thesis]. Instituto Superior Técnico.

Propp, V. (1984). The Russian fairy tale. Izd. LSU.

Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. (2021) Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digital Medicine, 4(1), 86. DOI: https://doi.org/10.1038/s41746-021-00455-y

Reusens, M., Stevens, A., Tonglet, J., De Smedt, J., Verbeke, W., Vanden Broucke, S., & Baesens, B. (2024). Evaluating text classification: A benchmark study. Expert Systems with Applications, 254, 124302. DOI: https://doi.org/10.1016/j.eswa.2024.124302

Sabharwal, N. & Agrawal, A. (2021). BERT model applications: Question answering system in hands-on question answering systems with BERT. Apress eBooks. DOI: https://doi.org/10.1007/978-1-4842-6664-9

Samothrakis, В. S., & Fasli, M. (2015). Emotional sentence annotation helps predict fiction genre. PloS One, 10(11), e0141922. DOI: https://doi.org/10.1371/journal.pone.0141922

Santoro, A. & Faulkner, R. & Raposo, D. & Rae, J. & Chrzanowski, M. & Weber, Th. & Wierstra, D. & Vinyals, O. & Pascanu, R. & Lillicrap, T. (2018). Relational recurrent neural networks. arXiv. DOI: https://doi.org/10.48550/arXiv.1806.01822

Solnyshkina, M.I., Kupriyanov, R.V. & Shoeva, G.N. (2024). Linguistic profiling of text: Adventure story vs. Textbook. In Scientific Result. Questions of Theoretical and Applied Linguistics, 10(1), 115-132. DOI: https://doi.org/10.18413/2313-8912-2024-10-1-0-7

Solovyev, V., Solnyshkina, M., & Tutubalina, E. (2023). Topic modeling for text structure assessment: The case of Russian academic texts. Journal of Language and Education, 9(3), 143-158. DOI: https://doi.org/10.17323/jle.2023.16604

Sun, F., Liu, J., Wu, J., Pei, Ch., Lin, X., Ou, W. & Jiang P. (2019). BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. Proceedings of the 28th ACM International Conference on Information and Knowledge Management (pp. 1441-1450). Association for Computing Machinery. DOI: https://doi.org/10.1145/3357384.3357895

Tangherlini, T. & Chen, R. (2024). Travels with BERT: Surfacing the intertextuality in Hans Christian Andersen's travel writing and fairy tales through the network lens of large language model based topic modeling. Orbis Litterarum, 79(6), 519-562. DOI: https://doi.org/10.1111/oli.12458

Tianqi, Ch. & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794). ACM. DOI: https://doi.org/10.1145/2939672.2939785

Tomin, E., Solnyshkina, M., Gafiyatova, E. & Galiakhmetova, A. (2023). Automatic text classification as relevance measure for Russian school physics texts. In 2023 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (pp. 366-370). IEEE. DOI: https://doi.org/10.1109/MCSoC60832.2023.00061

Tudorovskaya, E.A. (1961). On classification of Russian folk fairy tales. Specifics of Russian folklore genres. Specificity of genres of Russian folklore: Theses of the report. Institute of Russian Literature (Pushkin House).

Uther, H.-J. (٢٠٠٤). The types of international folktales: A classification and bibliography, based on the system of Antti Aarne and Stith Thompson. Folklore Fellows' Communications (vol. 3, pp. 284-286). Suomalainen Tiedeakatemia.

Thompson, S. (١٩٢٨). The types of the folk-tale: A classification and bibliography. Folklore Fellows' Communications, (74). Suomalainen Tiedeakatemia.

Thompson, S. (١٩٧٧). The folktale. University of California Press.

Wang, Z., Wu, H. Liu, H.& Cai, Q.-H. (2020). BertPair-networks for sentiment classification. 2020 International Conference on Machine Learning and Cybernetics (pp. 273-278). IEEE Xplore. DOI: https://doi.org/10.1109/ICMLC51923.2020.9469534

Worsham, В, J., & Kalita, J. (2018). Genre identification and the compositional effect of genre in literature. Proceedings of the 27th International Conference on Computational Linguistics (pp. 1963-1973). Association for Computational Linguistics. https://aclanthology.org/C18-1167.

Xiong, H. & Wu, J. & Liu, L. (2010). Classification with ClassOverlapping: A systematic study. 1st International Conference on E-Business Intelligence (pp. 303-309). Atlantis Press. DOI: https://doi.org/10.2991/icebi.2010.43

A BERT-Based Classification Model: The Case of Russian Fairy Tales

Abstract

Downloads

References

Most read articles by the same author(s)

Indexing