Topic Modeling for Text Structure Assessment: The case of Russian Academic Texts
Background: Automatic assessment of text complexity levels is viewed as an important task, primarily in education. The existing methods of computing text complexity employ simple surface text properties neglecting complexity of text content and structure. The current paradigm of complexity studies can no longer keep up with the challenges of automatic evaluation of text structure.
Purpose: The aim of the paper is twofold: (1) it introduces a new notion, i.e. complexity of a text topical structure which we define as a quantifiable measure and combination of four parameters, i.e. number of topics, topic coherence, topic distribution, and topic weight. We hypothesize that these parameters are dependent variables of text complexity and aligned with the grade level; (2) the paper is also aimed at justifying applicability of the recently developed methods of topic modeling to measuring complexity of a text topical structure.
Method: To test this hypothesis, we use Russian Academic Corpus comprising school textbooks, texts of Russian as a foreign language and fiction texts recommended for reading in different grades, and employ it in three versions: (i) Full Texts Corpus, (ii) Corpus of Segments, (iii) Corpus of Paragraphs. The software tools we implement include LDA (Latent Dirichlet Allocation), OnlineLDA and Additive Regularization Of Topic Models with Word2vec-based metric and Normalized Pairwise Mutual Information.
Results: Our findings include the following: the optimal number of topics in educational texts varies around 20; topic coherence and topic distribution are identified to be functions of grade level complexity; text complexity is suggested to be estimated with structural organization parameters and viewed as a new algorithm complementing the classical approach of text complexity assessment based on linguistic features.
Conclusion: The results reported and discussed in the article strongly suggest that the theoretical framework and the analytic algorithms used in the study might be fruitfully applied in education and provide a basis for assessing complexity of academic texts.
Copyright (c) 2023 National Research University Higher School of Economics
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the Copyright Notice.