Topic Modeling for Text Structure Assessment: The case of Russian Academic Texts

Valery Solovyev; Marina Solnyshkina; Elena  Tutubalina

doi:10.17323/jle.2023.16604

Valery Solovyev Kazan Federal University, Kazan, Russia https://orcid.org/0000-0002-7780-9823
Marina Solnyshkina Kazan Federal University, Kazan, Russia https://orcid.org/0000-0003-1885-3039
Elena Tutubalina Ivannikov Institute for System Programming of the RAS, Moscow, Russia https://orcid.org/0000-0001-7936-0284

DOI: https://doi.org/10.17323/jle.2023.16604

Keywords: text structure, topic modeling, school textbooks, text complexity, Russian language

Abstract

Background: Automatic assessment of text complexity levels is viewed as an important task, primarily in education. The existing methods of computing text complexity employ simple surface text properties neglecting complexity of text content and structure. The current paradigm of complexity studies can no longer keep up with the challenges of automatic evaluation of text structure.

Purpose: The aim of the paper is twofold: (1) it introduces a new notion, i.e. complexity of a text topical structure which we define as a quantifiable measure and combination of four parameters, i.e. number of topics, topic coherence, topic distribution, and topic weight. We hypothesize that these parameters are dependent variables of text complexity and aligned with the grade level; (2) the paper is also aimed at justifying applicability of the recently developed methods of topic modeling to measuring complexity of a text topical structure.

Method: To test this hypothesis, we use Russian Academic Corpus comprising school textbooks, texts of Russian as a foreign language and fiction texts recommended for reading in different grades, and employ it in three versions: (i) Full Texts Corpus, (ii) Corpus of Segments, (iii) Corpus of Paragraphs. The software tools we implement include LDA (Latent Dirichlet Allocation), OnlineLDA and Additive Regularization Of Topic Models with Word2vec-based metric and Normalized Pairwise Mutual Information.

Results: Our findings include the following: the optimal number of topics in educational texts varies around 20; topic coherence and topic distribution are identified to be functions of grade level complexity; text complexity is suggested to be estimated with structural organization parameters and viewed as a new algorithm complementing the classical approach of text complexity assessment based on linguistic features.

Conclusion: The results reported and discussed in the article strongly suggest that the theoretical framework and the analytic algorithms used in the study might be fruitfully applied in education and provide a basis for assessing complexity of academic texts.

Downloads

Download data is not yet available.

References

Al Tamimi, A. K., Jaradat, M., Al-Jarrah, N. & Ghanem, S. (2014). AARI: Automatic Arabic readability index.International Arab Journal of Information Technology, 11(4), 370-378.

Arfe, B., Mason, L. & Fajardo, I. (2018). Simplifying informational text structure for struggling readers. Reading and Writing, 31, 2191-2210. DOI: https://doi.org/10.1007/s11145-017-9785-6

Bailin, A., & Grafstein, A. (2016). Readability: Text and context. Palgrave Macmillan.

Balyan, R., McCarthy, K.S., & McNamara, D.S. (2018).Comparing machine learning classification approaches for predicting expository text difficulty. In The Thirty-First International Flairs Conference (FLAIRS-31) (pp. 421-426). AAAI press.

Berendes, K., & Vajjala, S. (2018). Reading demands in secondary school: Does the linguistic complexity of textbooks increase with grade level and the academic orientation of the school track? Journal of Educational Psychology, 110(4), 518-543. https://doi/. DOI: https://doi.org/10.1037/edu0000225

Blei, D.M., Ng, A.Y., & Jordan, M.I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993-1022.

Borda, M. (2011). Fundamentals in information theory and coding. Springer.

Bouma, G. (2009). Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL (pp. 31-40). Gunter Narr Verlag Tubingen.

Boyd-Graber, J., Hu, Y., & Mimno, D. (2017). Applications of topic models. Foundations and Trends® in Information Retrieval, 11(2-3), 143-296. DOI: https://doi.org/10.1561/1500000030

Chen, Y.-H., & Daowadung, P. (2015). Assessing readability of Thai text using support vector machines. Maejo International Journal of Science and Technology, 9(3), 355-369.

Chen, Y.-T., Chen, Y.-H., & Cheng, Y.-C. (2013). Assessing Chinese readability using term frequency and lexical chain.

International Journal of Computational Linguistics and Chinese Language Processing, 18(2), 1-18.

Chuang, Hsiao-yu (1993). Topical structure and writing quality: A study of students' expository writing. Theses Digitization Project, 686. California State University.

Crossley, S. A., Greenfield, J., & McNamara, D. S. (2008). Assessing text readability using cognitively based indices. TESOL Quarterly, 42(3), 475-493.

Crossley, S. A., Yang, H. S., & McNamara, D. S. (2014). What's so simple about simplified texts? A computational and psycholinguistic investigation of text comprehension and text processing. Reading in a Foreign Language, 26(1), 92-113.

Diakidoy, I.-A. N., Kendeou, P., & Ioannides, C. (2003). Reading about energy: The effects of text structure in science learning and conceptual change. Contemporary Educational Psychology, 28(3), 335-356. DOI: https://doi.org/10.1016/S0361-476X(02)00039-5

Eremeev, M. A., & Vorontsov, K. V. (2020). Quantile-based approach to estimating cognitive text complexity.Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference Dialogue, 19, 256-269. РГГУ.

Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32, 221-233.

Gatiyatullina, G.M., Solnyshkina, M.I., Kupriyanov, R.V., & Ziganshina, C.R. (2023). Lexical density as a complexity predictor: The case of science and social studies textbooks. Research Result. Theoretical and Applied Linguistics, 9(1), 11-26. DOI: https://doi.org/10.18413/2313-8912-2023-9-1-0-2

Gazzola, M., Leal, S., Pedroni, B., Theoto Rocha, F., Pompéia, S., & Aluísio, S. (2022). Text complexity of open educational resources in Portuguese: mixing written and spoken registers in a multi-task approach. Language Resources and Evaluation, 56(2), 621-650.

Hobbs, J. (1990). Literature and cognition. Stanford.

Hoffman, M.D., Blei, D., & Bach, F. (2010). Online inference for latent Dirichlet allocation. In Neural Information Processing Systems (pp. 856-864). Curran Associates, Inc.

Ivanov, V.V, Solnyshkina, M.I., & Solovyev, V.D. (2018). Efficiency of text readability features in Russian academic texts. In Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference Dialogue 2018 (pp. 267-283). RGGU.

Kendeou P., & Broek, P. (2007). The effects of prior knowledge and text structure on comprehension processes during reading of scientific texts. Memory & Cognition, 35(7), 1567-1577. DOI: https://doi.org/10.3758/BF03193491

Kintsch, W. (1998).Comprehension: A paradigm for cognition. Cambridge University Press.

Kintsch, W., & Vipond, D. (1979). Reading comprehension and readability in educational practice and psychological theory. In L. Nilsson (Ed.), Perspectives on memory research (pp. 329-365). Psychology Press.

Kutuzov, A., & Kuzmenko, E. (2017). Web vectors: A toolkit for building web interfaces for vector semantic models. In International Conference on Analysis of Images, Social Networks and Texts (pp. 155-161). Springer Cham.

Kyle, K., Crossley, S. A., & Berger, C. (2018). The tool for the automatic analysis of lexical sophistication (TAALES): Version 2.0. Behavior Research Methods, 50(3), 1030-1046.

Laposhina, А. N., Veselovskaya, Т. V., Lebedeva, M. U., & Kupreshchenko, O. F. (2018). Automated text readability assessment for Russian second language learners. In Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference Dialogue 2018 (pp. 396-406). RGGU.

Lau, J. H. & Baldwin, T. (2016). The sensitivity of topic coherence evaluation to topic cardinality. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie (pp. 483-487). Association for Computational Linguistics.

Loftus, G.R. (1983) The continuing persistence of the icon. Behavioral and Brain Science, 6(1), 28. DOI: https://doi.org/10.1017/S0140525X00014461

Martinc, M., Pollak, S., & Robnik-Šikonja, M. (2021). Supervised and unsupervised neural approaches to text readability.

Computational Linguistics, 47(1), 141-179.

McBride, D. M., & Cutting, J. C. (2015). Cognitive psychology: Theory, process, and methodology. Sage.

McCallum, A.K. (2002). Mallet: A machine learning for language toolkit. University of Massachusetts Amherst.

McNamara, D.S., Graesser, A. C., McCarthy, P. M., & Cai, Zh. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press.

McNamara, D. S., Kintsch, E., Songer, N. B., & Kintsch, W. (1996). Are good texts always better? Interactions of text coherence, background knowledge, and levels of understanding in learning from text. Cognition and Instruction, 14(1), 1-43. DOI: https://doi.org/10.1207/s1532690xci1401_1

McNamara, D. S., Louwerse, M. M., McCarthy, P. M., & Graesser, A. C. (2010). Coh-metrix: Capturing linguistic features of cohesion. Discourse Processes, 47(4), 292-330. DOI: https://doi.org/10.1080/01638530902959943

McNamara, D. S., Roscoe, R., Allen, L., & Balyan, R., & McCarthy, K.S. (2019). Literacy: From the perspective of text and discourse theory. Journal of Language and Education, 5(3), 56-69. DOI: https://doi.org/10.17323/jle.2019.10196

Mulunda, C.K., Wagacha, P.W., & Muchemi, L. (2018). Review of trends in topic modeling techniques, tools, inference algorithms and applications. In 2018 5th International Conference on Soft Computing & Machine Intelligence (ISCMI) (pp. 28-37). IEEE.

Newman, D., Jey, H. L., Karl, G., & Timothy, B. (2010a). Automatic evaluation of topic coherence. Human language technologies. (pp. 100-108). Association for Computational Linguistics.

Newman, D., Noh, Y., Talley, E., Karimi, S., & Baldwin, T. (2010b). Evaluating topic models for digital libraries. Proceedings of the 10th annual joint conference on Digital libraries (pp. 215-224). Association for Computing Machinery.

Nikolenko, S.I. (2016). Topic quality metrics based on distributed word representations. In Proceedings of the 39th.

International ACM SIGIR conference on Research and Development in Information Retrieval (pp. 1029-1032). Association for Computing Machinery.

Ninio, A., & Snow, C. (1999). The development of pragmatics: Learning to use language appropriately. In T. K. Bhatia &.

W. C. Ritchie (Eds.), Handbook of language acquisition (pp. 347-383). Academic Press.

Oborneva, I. V. (2006). Automated assessment of the complexity of educational texts based on statistical parameters [Unpublished doctoral dissertation, ]. Institute of Contents and Methods of Training RAO.

Rehurek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (pp. 45-50). University of Malta.

Roehling, J. V., Hebert, M., Ron Nelson, J., & Bohaty, J. J. (2017). Text structure strategies for improving expository reading comprehension. The Reading Teacher, 71(1), 71- 82. DOI: https://doi.org/10.1002/trtr.1590

Sakhovskiy A., Solovyev V., & Solnyshkina M. (2020a). Topic modeling for assessment of text complexity in Russian textbooks. Proceedings of Ivannikov ISPRAS Open Conference (pp. 102-108). IEEE.

Sakhovskiy, A., Tutubalina, E., Solovyev, V., & Solnyshkina, M. (2020b). Topic modeling as a method of educational text structuring. DeSE (pp. 399-405). IEEE.

Santucci, V., Santarelli, F., Forti, L., & Spina, S. (2020). Automatic classification of text complexity. Applied Sciences, 10(20), 7285.

Shavrina, T., & Shapovalova, O. (2017). To the methodology of corpus construction for machine learning: Taiga syntax tree corpus and parser. In Proceedings of the International Conference Corpus Linguistics-2017 (pp. 78-84). St Petersburg State University.

Si, I., & Callan, J. (2001). A statistical model for scientific readability. In Proceedings of the tenth international conference on Information and knowledge management (pp. 574-576). Association for Computing Machinery.

Solnyshkina, M.I., Harkova, E.V., Kazachkova, M.B. (2020) The structure of cross-linguistic differences: Meaning and context of ‘readability' and its Russian equivalent ‘chitabelnost'. Journal of Language and Education, 6(1), 103-119. DOI: https://doi.org/10.17323/jle.2020.7176

Solnyshkina, M.I., Harkova, E.V., & Kiselnikov, A.S. (2014). Unified (Russian) state exam in English: Reading comprehension tasks. English Language Teaching, 7(12), 1-11. DOI: https://doi.org/10.5539/ELT.V7N12P1

Solovyev, V., Solnyshkina, M., Ivanov, V., & Batyrshin, I. (2019). Prediction of reading difficulty in Russian academic texts.

Journal of Intelligent & Fuzzy Systems, 36(5), 4553-4563. DOI: https://doi.org/10.3233/JIFS-179007

Solovyev, V. D., Solnyshkina, M. I., & McNamara, D. S. (2022).Computational linguistics and discourse complexology: Paradigms and research methods.Russian Journal of Linguistics, 26(2), 275-316. DOI: https://doi.org/10.22363/2687-0088-31326

Tanaka-Ishii, K., Tezuka, S., & Terada H. (2010). Sorting texts by readability.Computational Linguistics, 36(2), 203-227. DOI: https://doi.org/10.1162/coli.09-036-R2-08-050

Thorndyke, P.W. (1977). Cognitive structures in comprehension and memory in narrative discourse. Cognitive Psychology, 14, 560-589. DOI: https://doi.org/10.1016/0010-0285(77)90005-6

Toma, I., Marica, A. M., Dascalu, M., & Trausan-Matu, S. (2021). Readerbench-automated feedback generation for essays in Romanian. University Politehnica of Bucharest Scientific Bulletin Series C-Electrical Engineering and Computer Science, 83(2), 21-34.

Vorontsov, K., Frei, O., Apishev, M., Romov, P., & Dudarenko, M. (2015). Bigartm: Open source library for regularized multimodal topic modeling of large collections. In International Conference on Analysis of Images, Social Networks and Texts (pp. 370-381). Springer Cham.

Vorontsov, K., & Potapenko, A. (2015). Additive regularization of topic models. Machine Learning, 101(1-3), 303-323.

Watson T.R. (2016). Discourse topics. Amsterdam; Philadelphia: John Benjamins Publishing Company.

Williams J.P. (2005). Instruction in reading comprehension for primary-grade students: A focus on text structure. The Journal of Special Education, 39(1), 6-18.

Topic Modeling for Text Structure Assessment: The case of Russian Academic Texts

Abstract

Downloads

References

Most read articles by the same author(s)

Indexing