Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts?

Dmitry Morozov; Timur Garipov; Olga Lyashevskaya; Svetlana Savchuk; Boris Iomdin; Anna Glazkova

doi:10.17323/jle.2024.22237

Dmitry Morozov Novosibirsk State University, Novosibirsk, Russia https://orcid.org/0000-0003-4464-1355
Timur Garipov Novosibirsk State University, Novosibirsk, Russia https://orcid.org/0009-0008-4527-2268
Olga Lyashevskaya HSE University, Moscow, Russia; Vinogradov Russian Language Institute, Russian Academy of Sciences, Moscow, Russia https://orcid.org/0000-0001-8374-423X
Svetlana Savchuk Vinogradov Russian Language Institute, Russian Academy of Sciences, Moscow, Russia https://orcid.org/0000-0003-0464-7269
Boris Iomdin independent researcher https://orcid.org/0000-0002-1767-5480
Anna Glazkova University of Tyumen, Tyumen, Russia https://orcid.org/0000-0001-8409-6457

DOI: https://doi.org/10.17323/jle.2024.22237

Keywords: automatic morpheme segmentation, Russian language morphology, dictionary expansion, morphological analysis, natural language processing, expert-level performance, convolutional neural networks

Abstract

Introduction: Numerous algorithms have been proposed for the task of automatic morpheme segmentation of Russian words. Due to the differences in task formulation and datasets utilized, comparing the quality of these algorithms is challenging. It is unclear whether the errors in the models are due to the ineffectiveness of algorithms themselves or to errors and inconsistencies in the morpheme dictionaries. Thus, it remains uncertain whether any algorithm can be used to automatically expand the existing morpheme dictionaries.

Purpose: To compare various existing algorithms of morpheme segmentation for the Russian language and analyze their applicability in the task of automatic augmentation of various existing morpheme dictionaries.

Results: In this study, we compared several state-of-the-art machine learning algorithms using three datasets structured around different segmentation paradigms. Two experiments were carried out, each employing five-fold cross-validation. In the first experiment, we randomly partitioned the dataset into five subsets. In the second, we grouped all words sharing the same root into a single subset, excluding words that contained multiple roots. During cross-validation, models were trained on four of these subsets and evaluated on the remaining one. Across both experiments, the algorithms that relied on ensembles of convolutional neural networks consistently demonstrated the highest performance. However, we observed a notable decline in accuracy when testing on words containing unfamiliar roots. We also found that, on a randomly selected set of words, the performance of these algorithms was comparable to that of human experts.

Conclusion: Our results indicate that although automatic methods have, on average, reached a quality close to expert level, the lack of semantic consideration makes it impossible to use them for automatic dictionary expansion without expert validation. The conducted research revealed that further research should be aimed at addressing the key identified issues: poor performance with unknown roots and acronyms. At the same time, when a small number of unfamiliar roots can be assumed in the test dataset, an ensemble of convolutional neural networks should be utilized. The presented results can be used in the development of morpheme-oriented tokenizers and systems for analyzing the complexity of texts.

Downloads

Download data is not yet available.

References

Bakulina, G. A. (2012). Morfemnyy razbor slova: novye podkhody - novye vozmozhnosti [Morpheme segmentation: new approaches - new opportunities]. Nachal'naya shkola, (4), 29-32.

Batsuren, K., Bella, G., Arora, A., Martinovic, V., Gorman, K., Žabokrtský, Z., Ganbold, A., Dohnalová, Š., Ševčíková, M., Pelegrinová, K., Giunchiglia, F., Cotterell, R., & Vylomova, E. (2022). The SIGMORPHON 2022 shared task on morpheme segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 103-116). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2022.sigmorphon-1.11

Bodnár, J. (2022). JB132 submission to the SIGMORPHON 2022 shared task 3 on morphological segmentation. Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 152-156). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2022.sigmorphon-1.17

Bolshakov, I.A. (2013). Krossleksika: Universum sviazi mezhdu russkimi slovami [Crosslexica: a universe of links between russian words]. Biznes-informatika, 3(25), 12-19.

Bolshakova, E., Sapin, A. (2019). Bi-LSTM model for morpheme segmentation of russian words. In Ustalov, D., Filchenkov, A., Pivovarova, L. (Eds.), Artificial Intelligence and Natural Language. AINL 2019.Communications in Computer and Information Science (pp. 151-160). Springer. DOI: https://doi.org/10.1007/978-3-030-34518-1_11

Bolshakova, E., Sapin, A. (2021). Building a Combined morphological model for Russian word forms. In Burnaev, E. et al. (Eds), Analysis of Images, Social Networks and Texts. AIST 2021. Lecture Notes in Computer Science (vol. 13217, pp. 45-55). Springer. DOI: https://doi.org/10.1007/978-3-031-16500-9_5

Bolshakova, E.I., & Sapin, A.S. (2019).Comparing models of morpheme analysis for Russian words based on machine learning.Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference Dialogue 2019 (pp. 104-113).Russian State University for the Humanities.

Creutz, M., & Lagus, K. (2002). Unsupervised discovery of morphemes. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning (pp. 21-30). Association for Computational Linguistics. DOI: https://doi.org/10.3115/1118647.1118650

Cotterell, R., Vieira, T., & Schütze, H. (2016). A joint model of orthography and morphological segmentation. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 664-669). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/N16-1080

Garipov, T., Morozov, D., & Glazkova, A. (2023). Generalization ability of CNN-based morpheme segmentation. 2023 Ivannikov Ispras Open Conference (ISPRAS) (pp. 58-62). IEEE. DOI: https://doi.org/10.1109/ISPRAS60948.2023.10508171

Girrbach, L. (2022). SIGMORPHON 2022 shared task on morpheme segmentation submission description: Sequence labelling for word-level morpheme segmentation. Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 124-130). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2022.sigmorphon-1.13

Grönroos, S.-A., Virpioja, S., & Kurimo, M. (2020). Morfessor EM+Prune: Improved subword segmentation with expectation maximization and pruning. Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3944-3953). European Language Resources Association.

Imani, A., Lin, P., Kargaran, A. H., Severini, S., Sabet, M. J., Kassner, N., Ma, C., Schmid, H., Martins, A., Yvon, F., & Schütze, H. (2023). Glot500: Scaling multilingual corpora and language models to 500 languages. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers, pp. 1082-1117). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2023.acl-long.61

Iomdin, B. L. (2019). How to define words with the same root? Russian Speech, (1), 109-115. DOI: https://doi.org/10.31857/S013161170003980-7

Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers, pp. 66-75). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/P18-1007

Kuratov, Y. & Arkhipov, M. (2019). Adaptation of deep bidirectional multilingual transformers for Russian language.Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference Dialogue 2019 (pp. 333-339).Russian State University for the Humanities.

Kuznetsova, A. I. & Efremova, T. F. (1986). Dictionary of morphemes of the Russian language.Russkii yazyk.

Levine, L. (2022). Sharing data by language family: Data augmentation for romance language morpheme segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 117-123). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2022.sigmorphon-1.12

Matthews, A., Neubig, G., & Dyer, C. (2018). Using Morphological knowledge in open-vocabulary neural language models. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (vol. 1, pp. 1435-1445). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/N18-1130

Morozov, D. A., Smal, I. A., Garipov, T. A., & Glazkova, A. V. (2024). Keywords, morpheme parsing and syntactic trees: Features for text complexity assessment. Modeling and Analysis of Information Systems, 31(2), 206-220. DOI: https://doi.org/10.18255/1818-1015-2024-2-206-220

Peters, B. & Martins, A. F. T. (2022). Beyond characters: Subword-level morpheme segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 131-138). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2022.sigmorphon-1.14

Pranjić, M., Robnik-Šikonja M., & Pollak, S. (2024). LLMSegm: Surface-level morphological segmentation using large language model. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (pp. 10665-10674). ELRA and ICCL.

Savchuk, S. O., Arkhangelskiy, T., Bonch-Osmolovskaya, A. A., Donina, O. V., Kuznetsova, Yu. N., Lyashevskaya, O. N., Orekhov, B. V., & Podryadchikova, M. V. (2024).Russian national corpus 2.0: New opportunities and development prospects. Voprosy Jazykoznanija, 2, 7-34. DOI: https://doi.org/10.31857/0373-658X.2024.2.7-34

Schuster, M. & Nakajima, K. (2012). Japanese and Korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing (pp. 5149-5152). IEEE. DOI: https://doi.org/10.1109/ICASSP.2012.6289079

Sorokin, A. & Kravtsova, A. (2018). Deep convolutional networks for supervised morpheme segmentation of Russian language. In D. Ustalov, A. Filchenkov, L. Pivovarova, & J. Žižka, (Eds.), Artificial Intelligence and Natural Language (pp. 3-10). Springer. DOI: https://doi.org/10.1007/978-3-030-01204-5_1

Sorokin, A. (2022). Improving morpheme segmentation using BERT embeddings. In E. Burnaev, D. Ignatov, S. Ivanov, M. Khachay, O. Koltsova, A. Kutuzov, S.Kuznetsov, N. Loukachevitch, A. Napoli, A. Panchenko, P. Pardalos, J. Saramäki, A. Savchenko, E. Tsymbalov, & E. Tutubalina, (Eds.), Analysis of images, social networks and texts (pp. 148-161). Springer. DOI: https://doi.org/10.1007/978-3-031-16500-9_13

Tikhonov, A. N. (1990). Slovoobrazovatel‘nyi slovar' russkogo yazyka [Word Formation Dictionary of Russian language].Russkiy yazyk.

Vinokur, G. O. (1946). Zametki po russkomu slovoobrazovaniyu [Notes on Russian word formation]. Izvestiya Akademii nauk SSSR. Seriya literatury i yazyka, V(4), 317-317.

Wehrli, S., Clematide, S., & Makarov, P. (2022). CLUZH at SIGMORPHON 2022 shared tasks on morpheme segmentation and inflection generation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 212-219). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2022.sigmorphon-1.21

Zundi, T. & Avaajargal, C. (2022). Word-level Morpheme segmentation using Transformer neural network. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 139-143). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2022.sigmorphon-1.15

Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts?

Abstract

Downloads

References

Most read articles by the same author(s)

Indexing