Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration
Abstract
Introduction: Readability formulas are crucial for identifying suitable texts for children's reading development. Traditional formulas, however, are linear models designed for alphabetic languages and struggle with numerous predictors.
Purpose: To develop advanced readability formulas for Chinese texts using machine-learning algorithms that can handle hundreds of predictors. It is also the first readability formula developed in Hong Kong.
Method: The corpus comprised 723 texts from 72 Chinese language arts textbooks used in public primary schools. The study considered 274 linguistic features at the character, word, syntax, and discourse levels as predictor variables. The outcome variables were the publisher-assigned semester scale and the teacher-rated readability level. Fifteen combinations of linguistic features were trained using Support Vector Machine (SVM) and Random Forest (RF) algorithms. Model performance was evaluated by prediction accuracy and the mean absolute error between predicted and actual readability. For both publisher-assigned and teacher-rated readability, the all-level-feature-RF and character-level-feature-RF models performed the best. The top 10 predictive features of the two optimal models were analyzed.
Results: Among the publisher-assigned and subjective readability measures, the all-RF and character-RF models performed the best. The feature importance analyses of these two optimal models highlight the significance of character learning sequences, character frequency, and word frequency in estimating text readability in the Chinese context of Hong Kong. In addition, the findings suggest that publishers might rely on diverse information sources to assign semesters, whereas teachers likely prefer to utilize indices that can be directly derived from the texts themselves to gauge readability levels.
Conclusion: The findings highlight the importance of character-level features, particularly the timing of a character's introduction in the textbook, in predicting text readability in the Hong Kong Chinese context.
Downloads
References
Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59(4), 390-412. https://doi.org/10.1016/j.jml.2007.12.005
Brent, M. R., & Siskind, J. M. (2001). The role of exposure to isolated words in early vocabulary development. Cognition, 81(2), B33-B44. https://doi.org/10.1016/S0010-0277(01)00122-6
Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles [Data set]. PloS One, 5(6), Article e10729. https://doi.org/10.1371/journal.pone.0010729
Chen, Y., Chen, Y., & Cheng, Y. (2013). Assessing Chinese readability using term frequency and lexical chain. Journal of Computational Linguistics & Chinese Language Processing, 18(2), 1-18.
Chen, Y., Tsai, Y., & Chen, Y. (2011). Chinese readability assessment using TF-IDF and SVM.
Cheng, Y., Xu, D., & Dong, J. (2020). 基于语文教材语料库的文本阅读难度分级关键因素分析与易读性公式研究 [A study on the analysis of key factors of text reading difficulty grading and the readability formula based on a corpus of language teaching materials]. 语言文字应用, 1, 132-143. https://doi.org/10.16499/j.cnki.1003-5397.2020.01.014
Chik, P. P., Ho, C. S., Yeung, P., Chan, D. W., Chung, K. K., Luan, H., Lo, L., & Lau, W. S. (2012). Syntactic skills in sentence reading comprehension among Chinese elementary school children. Reading and Writing, 25, 679-699. https://doi.org/10.1007/s11145-010-9293-4
Crossley, S. A., Skalicky, S., & Dascalu, M. (2019). Moving beyond classic readability formulas: New methods and new models. Journal of Research in Reading, 42(3-4), 541-561. https://doi.org/10.1111/1467-9817.12283
Crossley, S., Heintz, A., Choi, J. S., Batchelor, J., Karimi, M., & Malatinszky, A. (2023). A large-scaled corpus for assessing text readability. Behavior Research Methods, 55(2), 491-507. https://doi.org/10.3758/s13428-022-01802-x
Da, J. (2004). A corpus-based study of character and bigram frequencies in Chinese e-texts and its implications for Chinese language instruction [Data set]. Proceedings of the Fourth International Conference on New Technologies in Teaching and Learning Chinese, 501-511.
Education Bureau (2007). Lexical Lists for Chinese Learning in Hong Kong.
Fitzgerald, J., Elmore, J., Koons, H., Hiebert, E. H., Bowen, K., Sanford-Moore, E. E., & Stenner, A. J. (2015). Important text characteristics for early-grades text complexity. Journal of Educational Psychology, 107(1), 4–29. https://doi.org/10.1037/a0037289
Givón, T. (1995). Coherence in text vs. coherence in mind. Coherence in Spontaneous Text, 1995, 59-116.
Graesser, A. C., McNamara, D. S., & Kulikowich, J. M. (2011). Coh-metrix: Providing multilevel analyses of text characteristics. Educational Researcher, 40(5), 223-234. https://doi.org/10.3102/0013189X11413260
Huang, C. (2006, March, 6-7). Automatic acquisition of linguistic knowledge: From sinica corpus to gigaword corpus [Conference presentation] [Data set]. The 13th National Institute of Japanese Language International Symposium Language Corpora: Their Compilation and Application, Tokyo.
Jin, G. J., Xiao, H., Fu, L., & Zhang, Y. F. (2005). 现代汉语语料库建设及深加工[Construction and further processing of Chinese National Corpus] [Data set]. 语言文字应用, 2, 111–120. https://doi.org/10.16499/j.cnki.1003-5397.2005.02.017
Jing, X. (1995). 中文国文教材的适读性研究: 适读年级值的推估 [A study on the readability of Chinese national language teaching materials: Estimation of readability values of grade levels]. 教育研究资讯, 5, 113-127.
Korobov, M., & Lopuhin, K. (2019). Permutation importance.
Liu, M., Li, Y., Su, Y., & Li, H. (2023). Text complexity of Chinese elementary school textbooks: Analysis of text linguistic features using machine learning algorithms. Scientific Studies of Reading, 1-21. https://doi.org/10.1080/10888438.2023.2244620
Liu, M., Li, Y., Wang, X., Gan, L., & Li, H. (2021). 分级阅读初探: 基于小学教材的汉语可读性公式研究 [Leveled reading for primary students: Construction and evaluation of Chinese reeadability formulas based on textbooks]. 语言文字应用, 2, 116-126. https://doi.org/10.16499/j.cnki.1003-5397.2021.02.010
Liu, Q., Zhang, H. P., Yu, H. K., & Cheng, X. Q. (2004). 基于层叠隐马模型的汉语词法分析 [Chinese lexical analysis using cascaded hidden Markov model]. 计算机研究与发展, 41(8), 1421–1429.
Liu, Y., Shu, H., & Li, P. (2007). Word naming and psycholinguistic norms: Chinese [Data set]. Behavior Research Methods, 39(2), 192–198. https://doi.org/10.3758/BF03193147
Ma, W., & Chen, K. (2005). Design of CKIP Chinese word segmentation system. Chinese and Oriental Languages Information Processing Society, 14(3), 235-249.
McBride-Chang, C., Chow, B. W., Zhong, Y., Burgess, S., & Hayward, W. G. (2005). Chinese character acquisition and visual skills in two Chinese scripts. Reading and Writing, 18, 99-128. https://doi.org/10.1007/s11145-004-7343-5
Mesmer, H. A. E. (2005). Text decodability and the first-grade reader. Reading & Writing Quarterly, 21(1), 61-86. https://doi.org/10.1080/10573560590523667
Mesmer, H. A., & Hiebert, E. H. (2015). Third graders’ reading proficiency reading texts varying in complexity and length: Responses of students in an urban, high-needs school. Journal of Literacy Research, 47(4), 473-504. https://doi.org/10.1177/1086296X16631923
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel. M., Prettenhofer. P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12, 2825-2830.
Rodriguez-Galiano, V., Sanchez-Castillo, M., Chica-Olmo, M., & Chica-Rivas, M. J. O. G. R. (2015). Machine learning predictive models for mineral prospectivity: An evaluation of neural networks, random forest, regression trees and support vector machines. Ore Geology Reviews, 71, 804-818. https://doi.org/10.1016/j.oregeorev.2015.01.001
Solnyshkina, M., Zamaletdinov, R., Gorodetskaya, L., & Gabitov, A. (2017). Evaluating text complexity and Flesch-Kincaid grade level. Journal of Social Studies Education Research, 8(3), 238-248.
Stutz, F., Schaffner, E., & Schiefele, U. (2016). Relations among reading motivation, reading amount, and reading comprehension in the early elementary grades. Learning and Individual Differences, 45, 101-113. https://doi.org/10.1016/j.lindif.2015.11.022
Su, I., Yum, Y. N., & Lau, D. K. (2023). Hong Kong Chinese character psycholinguistic norms: Ratings of 4376 single Chinese characters on semantic radical transparency, age-of-acquisition, familiarity, imageability, and concreteness [Data set]. Behavior Research Methods, 55(6), 2989-3008. https://doi.org/10.3758/s13428-022-01928-y
Su, Y., & Samuels, S. J. (2010). Developmental changes in character-complexity and word-length effects when reading Chinese script. Reading and Writing, 23, 1085-1108. https://doi.org/10.1007/s11145-009-9197-3
Sung, Y., Chen, J., Cha, J., Tseng, H., Chang, T., & Chang, K. (2015). Constructing and validating readability models: The method of integrating multilevel linguistic features with machine learning. Behavior Research Methods, 47, 340-354. https://doi.org/10.3758/s13428-014-0459-x
Tsang, Y., Huang, J., Lui, M., Xue, M., Chan, Y. F., Wang, S., & Chen, H. (2018). MELD-SCH: A megastudy of lexical decision in simplified Chinese [Data set] . Behavior Research Methods, 50, 1763-1777. https://doi.org/10.3758/s13428-017-0944-0
Tse, C., Yap, M. J., Chan, Y., Sze, W. P., Shaoul, C., & Lin, D. (2017). The Chinese lexicon project: A megastudy of lexical decision performance for 25,000 traditional Chinese two-character compound words. Behavior Research Methods, 49, 1503-1519. https://doi.org/10.3758/s13428-016-0810-5
Wang, F., & Wu, F. (2020). Postnominal relative clauses in Chinese. Linguistics, 58(6), 1501-1542. https://doi.org/10.1515/ling-2020-0226
Wu S., Yu D., & Jiang X. (2020). 汉语文本可读性特征体系构建和效度验证 [Development of linguistic features system for Chinese text readability assessment and its validity verification]. 世界汉语教学, 34(1), 81-97.
Yeung, S. S., Siegel, L. S., & Chan, C. K. (2013). Effects of a phonological awareness program on English reading and spelling among Hong Kong Chinese ESL children. Reading and Writing, 26, 681-704. https://doi.org/10.1007/s11145-012-9383-6
Zhang, J., McBride-Chang, C., Wong, A. M., Tardif, T., Shu, H., & Zhang, Y. (2014). Longitudinal correlates of reading comprehension difficulties in Chinese children. Reading and Writing, 27, 481-501. https://doi.org/10.1007/s11145-013-9453-4
Copyright (c) 2024 National Research University Higher School of Economics
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the Copyright Notice.