Wrong Answers Only: Distractor Generation for Russian Reading Comprehension Questions Using a Translated Dataset

Никита Вячеславович Логин

doi:10.17323/jle.2024.22244

Никита Вячеславович Логин НИУ ВШЭ, Москва, Россия https://orcid.org/0009-0007-2480-8708

DOI: https://doi.org/10.17323/jle.2024.22244

Ключевые слова: автоматическая генерация отвлекающих факторов, вопросы с множественным выбором, понимание прочитанного, большая языковая модель, перевод набора данных

Аннотация

Введение: Вопросы на понимание прочитанного играют ключевую роль в изучении языка. Вопросы с множественным выбором представляют собой удобный формат оценки понимания текста, так как их можно легко оценивать автоматически. Наличие крупных наборов данных для понимания прочитанного также позволяет автоматически генерировать такие вопросы, снижая затраты на разработку банков тестовых заданий путем тонкой настройки языковых моделей. Хотя наборы данных для понимания прочитанного на английском языке широко распространены, для других языков, включая русский, ситуация иная. Генерация отвлекающих элементов (дистракторов) представляет собой сложную задачу, так как требует создания нескольких правдоподобных, но неправильных вариантов ответа.

Цель данной работы — разработать эффективное решение для генерации отвлекающих элементов для вопросов на понимание прочитанного в стиле экзамена по русскому языку и выяснить, может ли переведенный набор данных с английского языка предоставить основу для такого решения.

Метод: В данной статье мы настроили две предварительно обученные модели русского языка, RuT5 и RuGPT3 (Змитрович и др., 2024), для задачи генерации дистракторов для двух типов вопросов, извлеченных из большого набора данных с множественным выбором, автоматически переведенного с английского на русский. Первый тип вопросов включал выбор лучшего заголовка для заданного текста, а второй — выбор истинных/ложных утверждений. Модели оценивались автоматически на тестовых и валидационных подмножествах, а модели для генерации дистракторов истинных утверждений дополнительно тестировались на независимом наборе вопросов из российского государственного экзамена ЕГЭ.

Результаты: Было установлено, что модели превзошли нетонко настроенный базовый уровень, при этом производительность модели RuT5 оказалась выше, чем у RuGPT3. Модели лучше справлялись с вопросами на выбор истинных утверждений, чем с вопросами на выбор заголовков. Модели, обученные на переведенном наборе данных, показали более высокое качество по сравнению с моделями, обученными на существующем наборе дистракторов на русском языке. Модель на основе T5 также превзошла базовый уровень, установленный с использованием переведенных дистракторов, сгенерированных существующей моделью для английского языка.

Заключение: Полученные результаты демонстрируют возможность использования переведенного набора данных для генерации дистракторов и подчеркивают важность соответствия домена (языковой экзамен) и типа вопроса во входных данных.

Скачивания

Данные скачивания пока не доступны.

Литература

Alsubait, T. M. (2015). Ontology-based multiple-choice question generation [Unpublished PhD thesis]. University of Manchester.

Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgements. In J. Goldstein, A. Lavie, C.-Y. Lin, & C. Voss (Eds.), Proceedings of the ACL Workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65-72). Association for Computational Linguistics.

Belyanova, M. A., Andreev, A. M., & Gapanyuk, Y. E. (2022). Neural text question generation for Russian language using hybrid intelligent information systems approach. In B. Kryzhanovsky, W. Dunin-Barkowski, V. Redko, Y. Tiumentsev, & V. V. Klimov (Eds.), Advances in neural computation, machine learning, and cognitive research V (vol. 1008, pp. 217-223). Springer International Publishing. DOI: https://doi.org/10.1007/978-3-030-91581-0_29

Bitew, S. K., Hadifar, A., Sterckx, L., Deleu, J., Develder, & C., Demeester, T. (2022) Learning to reuse distractors to support multiple choice question generation in education. IEEE Transactions on Learning Technologies, 17, 375-390. IEEE Computer Society Press. DOI: https://doi.org/10.1109/TLT.2022.3226523

Bitew, S. K., Deleu, J., Develder, C., & Demeester, T. (2023) Distractor generation for multiple-choice questions with predictive prompting and large language models (Version 1). arXiv. DOI: https://doi.org/10.48550/arXiv.2307.16338

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Advances in Neural Information Processing Systems (vol. 33, pp. 1877-1901). Curran Associates, Inc. DOI: https://doi.org/10.48550/arXiv.2005.14165

Chung, H.-L., Chan, Y.-H., & Fan, Y.-C. (2020). A BERT-based distractor generation scheme with multi-tasking and negative answer training strategies. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 4390-4400). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.393

De-Fitero-Dominguez, D., Garcia-Lopez, E., Garcia-Cabot, A., Del-Hoyo-Gabaldon, J.-A., & Moreno-Cediel, A. (2024).

Distractor generation through text-to-text transformer models. IEEE Access, 12, 25580-25589. DOI: https://doi.org/10.1109/ACCESS.2024.3361673

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies (vol. 1: Long and Short Paper, pp. 4171-4186). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/N19-1423

Efimov, P., Chertok, A., Boytsov, L., & Braslavski, P. (2020). SberQuAD - Russian reading comprehension dataset: Description and analysis. In A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, H. Joho, C. Lioma, C. Eickhoff, A. Névéol, L. Cappellato, & N. Ferro (Eds.), Experimental IR meets multilinguality, multimodality, and interaction (vol. 12260, pp. 3-15). Springer International Publishing. DOI: https://doi.org/10.1007/978-3-030-58219-7_1

Elkins, S., Kochmar, E., Serban, I., & Cheung, J. C. K. (2023). How useful are educational questions generated by large language models? In N. Wang, G. Rebolledo-Mendez, V. Dimitrova, N. Matsuda, & O. C. Santos (Eds.), Artificial intelligence in education. Posters and late breaking results, workshops and tutorials, industry and innovation tracks, practitioners, doctoral consortium and blue sky (vol. 1831, pp. 536-542). Springer Nature Switzerland. DOI: https://doi.org/10.1007/978-3-031-36336-8_83

Fenogenova, A., Mikhailov, V., & Shevelev, D. (2020). Read and reason with MuSeRC and RuCoS: Datasets for machine reading comprehension for Russian. In D. Scott, N. Bel, & C. Zong (Eds.), Proceedings of the 28th International Conference on Computational Linguistics (pp. 6481-6497).International Committee on Computational Linguistics. DOI: https://doi.org/10.18653/v1/2020.coling-main.570

Gao, Y., Bing, L., Li, P., King, I., & Lyu, M. R. (2019). Generating distractors for reading comprehension questions from real examinations. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 6423-6430. DOI: https://doi.org/10.1609/aaai.v33i01.33016423

Ghanem, B. & Fyshe, A. (2024). DISTO: Textual distractors for multiple choice reading comprehension questions using negative sampling. In M. Marras, M. Ueno (Eds.), Proceedings of the 17th International Conference on Educational Data Mining (pp. 23-34).International Educational Data Mining Society. DOI: https://doi.org/10.5281/ZENODO.12729766

Glushkova, T., Machnev, A., Fenogenova, A., Shavrina, T., Artemova, E., & Ignatov, D. I. (2021). DaNetQA: A yes/no question answering dataset for the Russian language. In W. M. P. Van Der Aalst, V. Batagelj, D. I. Ignatov, M. Khachay, O. Koltsova, A. Kutuzov, S. O. Kuznetsov, I. A. Lomazova, N. Loukachevitch, A. Napoli, A. Panchenko, P. M. Pardalos, M. Pelillo, A. V. Savchenko, & E. Tutubalina (Eds.), Analysis of Images, Social Networks and Texts (vol. 12602, pp. 57-68). Springer International Publishing. DOI: https://doi.org/10.1007/978-3-030-72610-2_4

Hadifar, A., Bitew, S. K., Deleu, J., Develder, C., & Demeester, T. (2023). EduQG: A multi-format multiple-choice dataset for the educational domain. IEEE Access, 11, 20885-20896. DOI: https://doi.org/10.1109/ACCESS.2023.3248790

Huang, L., Le Bras, R., Bhagavatula, C., & Choi, Y. (2019). CosmosQA: Machine reading comprehension with contextual commonsense reasoning. In K. Inui, J. Jiang, V. Ng, & X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 2391-2401). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/D19-1243

Joshi, M., Choi, E., Weld, D., & Zettlemoyer, L. (2017). TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In R Barzilay., & M.-Y. Kan (Eds.), Proceedings of the 55th Annual Meeting of the Association for Computational linguistics (vol. 1: Long Papers, pp. 1601-1611). Association for Computational linguistics. DOI: https://doi.org/10.18653/v1/P17-1147

Kurdi, G., Leo, J., Parsia, B., Sattler, U., & Al-Emari, S. (2020). A systematic review of automatic question generation for educational purposes.International Journal of Artificial Intelligence in Education, 30(1), 121-204. DOI: https://doi.org/10.1007/s40593-019-00186-y

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A. M., Uszkoreit, J., Le, Q., & Petrov, S. (2019). Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7, 453-466. DOI: https://doi.org/10.1162/tacl_a_00276

Lai, G., Xie, Q., Liu, H., Yang, Y., & Hovy, E. (2017). RACE: Large-scale reading comprehension dataset from examinations. In M. Palmer, R. Hwa, & S. Riedel (Eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 785-794). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/D17-1082

Lee, D. B., Lee, S., Jeong, W. T., Kim, D., & Hwang, S. J. (2020). Generating diverse and consistent QA pairs from contexts with information-maximizing hierarchical conditional VAEs. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 208-224). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2020.acl-main.20

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2020). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In D. Jurafsky.

J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7871-7880). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2020.acl-main.703

Lin, C.-Y. (2004). ROUGE: A Package for automatic evaluation of summaries. In Text summarization branches out (pp. 74-81). Association for Computational Linguistics. https://aclanthology.org/W04-1013.

Lu, X., West, P., Zellers, R., Bras, R. L., Bhagavatula, C., & Choi, Y. (2021). NeuroLogic decoding: (Un)supervised neural text generation with predicate logic constraints. In K. Toutanova, A.Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, & Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies (pp. 4288-4299). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2021.naacl-main.339

Maity, S., Deroy, A., & Sarkar, S. (2024). A novel multi-stage prompting approach for language agnostic MCQ generation using GPT. In N. Goharian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, & I. Ounis (Eds.), Advances in information retrieval (vol. 14610, pp. 268-277). Springer Nature Switzerland. DOI: https://doi.org/10.1007/978-3-031-56063-7_18

Makhnytkina, O., Matveev, A., Svischev, A., Korobova, P., Zubok, D., Mamaev, N., & Tchirkovskii, A. (2020). Conversational question generation in Russian. In S. Balandin, L. Turchet, & T. Tyutina (Eds.), 2020 27th Conference of Open Innovations Association (FRUCT) (pp. 1-8). IEEE. DOI: https://doi.org/10.23919/FRUCT49677.2020.9211056

Manakul, P., Liusie, A., & Gales, M. (2023). MQAG: Multiple-choice question answering and generation for assessing information consistency in summarization. In J. C. Park, Y. Arase, B. Hu, W. Lu, D. Wijaya, A. Purwarianti, & A. A. Krisnadhi (Eds.), Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific chapter of the Association for Computational Linguistics (vol. 1: Long Papers, pp. 39-53). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2023.ijcnlp-main.4

Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. In The 40th Annual Meeting on Association for Computational Linguistics-ACL '02 (pp. 311-318). Association for Computational Linguistics. DOI: https://doi.org/10.3115/1073083.1073135

Paris, A. H., & Paris, S. G. (2003). Assessing narrative comprehension in young children. Reading Research Quarterly, 38(1), 36-76. DOI: https://doi.org/10.1598/RRQ.38.1.3

Qiu, Z., Wu, X., & Fan, W. (2020). Automatic distractor generation for multiple choice questions in standard tests. In D. Scott, N. Bel, & C. Zong (Eds.), Proceedings of the 28th International Conference on Computational Linguistics (pp. 2096-2106). International Committee on Computational Linguistics. DOI: https://doi.org/10.18653/v1/2020.coling-main.189

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21, 1, 5485-5551. https://dl.acm.org/doi/abs/. DOI: https://doi.org/10.5555/3455716.3455856

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ questions for machine comprehension of text. In J. Su, K. Duh, & X. Carreras (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 2383-2392). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/D16-1264

Reddy, S., Chen, D., & Manning, C. D. (2019). CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7, 249-266. DOI: https://doi.org/10.1162/tacl_a_00266

Rybin, I., Korablinov, V., Efimov, P., & Braslavski, P. (2021).RuBQ 2.0: An innovated Russian question answering dataset. In R. Verborgh, K. Hose, H. Paulheim, P.-A. Champin, M. Maleshkova, O. Corcho, P. Ristoski, & M. Alam (Eds.), The Semantic Web (vol. 12731, pp. 532-547). Springer International Publishing. DOI: https://doi.org/10.1007/978-3-030-77385-4_32

Sekulić, I., Aliannejadi, M., & Crestani, F. (2021). Towards facet-driven generation of clarifying questions for conversational search. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval (pp. 167-175). Association for Computing Machinery. DOI: https://doi.org/10.1145/3471158.3472257

Shavrina, T., Emelyanov, A., Fenogenova, A., Fomin, V., Mikhailov, V., Evlampiev, A., Malykh, V., Larin, V., Natekin, A., Vatulin, A., Romov, P., Anastasiev, D., Zinov, N., & Chertok, A. (2020, May). Humans keep it one hundred: An overview of AI Journey. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 2276-2284). European Language Resources Association. https://aclanthology.org/2020.lrec-1.277.

Tiedemann, J., & Thottingal, S. (2020). OPUS-MT - Building open translation services for the world. In A. Martins, H. Moniz, S. Fumega, B. Martins, F. Batista, L. Coheur, C. Parra, I. Trancoso, M. Turchi, A. Bisazza, J. Moorkens, A. Guerberof.

M. Nurminen, L. Marg, & M. L. Forcada (Eds.), Proceedings of the 22nd Annual Conference of the European Association for Machine Translation (pp. 479-480). European Association for Machine Translation. https://aclanthology.org/2020.eamt-1.61.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. ukasz, & Polosukhin, I. (2017). Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (vol. 30, 6000-6010). Curran Associates, Inc. https://dl.acm.org/doi/. DOI: https://doi.org/10.5555/3295222.3295349

Welbl, J., Liu, N. F., & Gardner, M. (2017). Crowdsourcing multiple choice science questions. In L. Derczynski, W. Xu, A. Ritter, & T. Baldwin (Eds.), Proceedings of the 3rd Workshop on Noisy User-generated Text (pp. 94-106). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/W17-4413

Xiao, D., Zhang, H., Li, Y., Sun, Y., Tian, H., Wu, H., & Wang, H. (2020). ERNIE-GEN: An enhanced multi-flow pre-training and fine-tuning framework for natural language generation. In C. Bessiere (Ed.) Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (pp. 3997-4003).International Joint Conferences on Artificial Intelligence Organization. DOI: https://doi.org/10.24963/ijcai.2020/553

Xu, Y., Wang, D., Yu, M., Ritchie, D., Yao, B., Wu, T., Zhang, Z., Li, T., Bradford, N., Sun, B., Hoang, T., Sang, Y., Hou, Y., Ma, X., Yang, D., Peng, N., Yu, Z., & Warschauer, M. (2022). Fantastic questions and where to find them: FairytaleQA - An authentic dataset for narrative comprehension. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers, pp. 447-460). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2022.acl-long.34

Xue, L., Constant, N., Roberts, A., Kale, N., Al-Rfou, R., Siddhant, A., Barua, A., & Raffel, C. (2020). MT5: A massively multilingual pre-trained text-to-text transformer (Version 3). arXiv. DOI: https://doi.org/10.48550/arXiv.2010.11934

Zhang, C. (2023). Automatic generation of multiple-choice questions (Version 1). arXiv. DOI: https://doi.org/10.48550/ARXIV.2303.14576

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). BERTScore: Evaluating text generation with BERT (Version 3). arXiv. DOI: https://doi.org/10.48550/ARXIV.1904.09675

Zmitrovich, D., Abramov, A., Kalmykov, A., Tikhonova, M., Taktasheva, E., Astafurov, D., Baushenko, M., Snegirev, A., Kadulin, V., Markov, S., Shavrina, T., Mikhailov, V., & Fenogenova, A. (2024). A family of pretrained transformer language models for Russian. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 507-524). ELRA Language Resource Association. DOI: https://doi.org/10.48550/arXiv.2309.10931