Fighting Evaluation Inflation: Concentrated Datasets for Grammatical Error Correction

Vladimir Starchenko; Darya Kharlamova; Elizaveta Klykova; Anastasia Shavrina; Aleksey Starchenko; Olga Vinogradova; Olga Lyashevskaya

doi:10.17323/jle.2024.22272

Vladimir Starchenko HSE University, Moscow, Russia https://orcid.org/0009-0004-6638-9124
Darya Kharlamova HSE University, Moscow, Russia https://orcid.org/0009-0007-5747-9525
Elizaveta Klykova independent researcher https://orcid.org/0009-0005-9160-2553
Anastasia Shavrina HSE University, Moscow, Russia https://orcid.org/0009-0002-2435-7314
Aleksey Starchenko HSE University, Moscow, Russia https://orcid.org/0000-0003-1650-7597
Olga Vinogradova independent researcher https://orcid.org/0000-0001-5928-1482
Olga Lyashevskaya HSE University, Moscow, Russia; Vinogradov Russian Language Institute, Russian Academy of Sciences, Moscow, Russia https://orcid.org/0000-0001-8374-423X

DOI: https://doi.org/10.17323/jle.2024.22272

Keywords: Grammatical error correction, L2 errors, ESL, concentrated datasets, cross-sentence GEC

Abstract

Background: Grammatical error correction (GEC) systems have greatly developed over the recent decade. According to common metrics, they often reach the level of or surpass human experts. Nevertheless, they perform poorly on several kinds of errors that are effortlessly corrected by humans. Thus, reaching the resolution limit, evaluation algorithms and datasets do not allow for further enhancement of GEC systems.

Purpose: To solve the problem of the resolution limit in GEC. The suggested approach is to use for evaluation concentrated datasets with a higher density of errors that are difficult for modern GEC systems to handle.

Method: To test the suggested solution, we look at distant-context-sensitive errors that have been acknowledged as challenging for GEC systems. We create a concentrated dataset for English with a higher density of errors of various types, half-manually aggregating pre-annotated examples from four existing datasets and further expanding the annotation of distant-context-sensitive errors. Two GEC systems are evaluated using this dataset, including traditional scoring algorithms and a novel approach modified for longer contexts.

Results: The concentrated dataset includes 1,014 examples sampled manually from FCE, CoNLL-2014, BEA-2019, and REALEC. It is annotated for types of context-sensitive errors such as pronouns, verb tense, punctuation, referential device, and linking device. GEC systems show lower scores when evaluated on the dataset with a higher density of challenging errors, compared to a random dataset with otherwise the same parameters.

Conclusion: The lower scores registered on concentrated datasets confirm that they provide a way for future improvement of GEC models. The dataset can be used for further studies focusing on distant-context-sensitive GEC.

Downloads

Download data is not yet available.

References

Bentley, J. (1985). Programming pearls: A spelling checker.Communications of the ACM, 28(5), 456-462. DOI: https://doi.org/10.1145/3532.315102

Brockett, C., Dolan, B., & Gamon, M. (2006). Correcting ESL errors using phrasal SMT techniques. 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL (pp. 249-256). Association for Computational Linguistics. DOI: https://doi.org/10.3115/1220175.1220207

Bryant, C., Felice, M., Andersen, Ø. E., & Briscoe, T. (2019). The BEA-2019 shared task on grammatical error correction. Proceedings of the fourteenth workshop on innovative use of NLP for building educational applications (pp. 52-75). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/W19-4406

Bryant, C., Felice, M., & Briscoe, T. (2017). Automatic annotation and evaluation of error types for grammatical error correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers, pp. 793-805). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/P17-1074

Bryant, C., Yuan, Z., Qorib, M. R., Cao, H., Ng, H. T., & Briscoe, T. (2023). Grammatical error correction: A survey of the state of the art.Computational Linguistics, 49(3), 643-701. DOI: https://doi.org/10.1162/coli_a_00478

Bryant, C., & Ng, H. T. (2015). How far are we from fully automatic high quality grammatical error correction? Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (vol. 1: Long Papers, pp. 697-707). Association for Computational Linguistics. DOI: https://doi.org/10.3115/v1/P15-1068

Burstein, J., Chodorow, M., & Leacock, C. (2003). CriterionSM online essay evaluation: An application for automated evaluation of student essays. Proceedings of the Fifteenth Conference on Innovative Applications of Artificial Intelligence (pp. 3-10).American Association for Artificial Intelligence.

Cargill, T. A. (1980). The design of a spelling checker's user interface. ACM SIGOA Newsletter, 1(3), 3-4. DOI: https://doi.org/10.1145/1017923.1017924

Chollampatt, S., & Ng, H. T. (2018). A multilayer convolutional encoder-decoder neural network for grammatical error correction. Proceedings of the AAAI conference on artificial intelligence (vol. 32(1), pp. 5755-5762). Association for the Advancement of Artificial Intelligence. DOI: https://doi.org/10.1609/aaai.v32i1.12069

Chollampatt, S., Wang, W., & Ng, H. T. (2019). Cross-sentence grammatical error correction. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 435-445). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/P19-1042

Dahlmeier, D., & Ng, H. T. (2011). Grammatical error correction with alternating structure optimization. Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 915-923). Association for Computational Linguistics.

Dahlmeier, D., Ng, H. T., & Wu, S. M. (2013). Building a large annotated corpus of learner English: The NUS corpus of learner English. Proceedings of the eighth workshop on innovative use of NLP for building educational applications (pp. 22-31). Association for Computational Linguistics.

Dale, R., Anisimoff, I., & Narroway, G. (2012). HOO 2012: A report on the preposition and determiner error correction shared task. Proceedings of the Seventh Workshop on Building Educational Applications Using NLP (pp. 54-62). Association for Computational Linguistics.

Du, Z., & Hashimoto, K. (2023). Sentence-level revision with neural reinforcement learning. Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023) (pp. 202-209). Association for Computational Linguistics.

Grundkiewicz, R., Junczys-Dowmunt, M., & Gillian, E. (2015). Human evaluation of grammatical error correction systems. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 461-470). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/D15-1052

Edunov, S., Ott, M., Auli, M., & Grangier, D. (2018). Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 489-500). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/D18-1045

Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378-382. DOI: https://doi.org/10.1037/h0031619

Jensen, K., Heidorn, G., Miller, L., & Ravin, Y. (1993). Parse fitting and prose fixing. Natural Language Processing: The PLNLP Approach (pp. 53-64). Springer. DOI: https://doi.org/10.1007/978-1-4615-3170-8_5

Katsumata, S., & Komachi, M. (2020). Stronger baselines for grammatical error correction using a pretrained encoder-decoder model. Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing (pp. 827-832). Association for Computational Linguistics.

Krippendorff, K. (2011).Computing Krippendorff's Alpha-Reliability. https://repository.upenn.edu/asc_papers/43.

Kwasny, S. C., & Sondheimer, N. K. (1981). Relaxation techniques for parsing grammatically ill-formed input in natural language understanding systems. American Journal of Computational Linguistics, 7(2), 99-108.

Lee, J. S. (2004). Automatic article restoration. Proceedings of the Student Research Workshop at HLT-NAACL 2004 (pp. 31-36). Association for Computational Linguistics.

Nangia, N., Vania, C., Bhalerao, R., & Bowman, S. (2020). CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1953-1967). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2020.emnlp-main.154

Omelianchuk, K., Atrasevych, V., Chernodub, A., & Skurzhanskyi, O. (2020). GECToR-Grammatical Error Correction: Tag, not Rewrite. Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 163-170). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2020.bea-1.16

Leacock, C., Gamon, M., & Brockett, C. (2009). User input and interactions on Microsoft Research ESL assistant. Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 73-81). Association for Computational Linguistics.

Leacock, C., Chodorow, M., Gamon, M., & Tetreault, J. (2014). Automated grammatical error detection for language learners (2nd ed.). Morgan & Claypool Publishers. DOI: https://doi.org/10.1007/978-3-031-02153-4

Li, W., & Wang, H. (2024). Detection-correction structure via general language model for grammatical error correction. arXiv preprint arXiv:2405.17804. DOI: https://doi.org/10.48550/arXiv.2405.17804

Marzi, G., Balzano, M., & Marchiori, D. (2024). K-Alpha Calculator-Krippendorff's Alpha Calculator: A user-friendly tool for computing Krippendorff's Alpha inter-rater reliability coefficient. Methods X, 12, 102545. DOI: https://doi.org/10.1016/j.mex.2023.102545

Napoles, C., Sakaguchi, K., Post, M., & Tetreault, J. (2015). Ground truth for grammatical error correction metrics. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Vol. 2: Short Papers, pp. 588-593). Association for Computational Linguistics. DOI: https://doi.org/10.3115/v1/P15-2097

Ng, H. T., Wu, S. M., Briscoe, T., Hadiwinoto, C., Susanto, R. H., & Bryant, C. (2014). The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the eighteenth conference on computational natural language learning: Shared task (pp. 1-14). Association for Computational Linguistics. DOI: https://doi.org/10.3115/v1/W14-1701

Qorib, M. R., & Ng, H. T. (2022). Grammatical error correction: Are we there yet? In Proceedings of the 29th International Conference on Computational Linguistics (pp. 2794-2800).International Committee on Computational Linguistics.

Randolph, J. J. (2005). Free-marginal multirater kappa (multirater K[free]): An alternative to fleiss' fixed-marginal multirater kappa. Presented at the Joensuu Learning and Instruction Symposium 2005 (October 14-15, 2005). http://files.eric.ed.gov/fulltext/ED490661.pdf.

Rothe, S., Mallinson, J., Malmi, E., Krause, S., & Severyn, A. (2021). A simple recipe for multilingual grammatical error correction. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (vol. 2: Short Papers, pp. 702-707). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2021.acl-short.89

Rozovskaya, A., & Roth, D. (2010). Training paradigms for correcting errors in grammar and usage. Human language technologies: The 2010 annual conference of the north american chapter of the association for computational linguistics (pp. 154-162). Association for Computational Linguistics.

Rozovskaya, A. & Roth, D., (2021). How good (really) are grammatical error correction systems? Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 2686-2698). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2021.eacl-main.231

Sakaguchi, K., Napoles, C., Post, M., & Tetreault, J. (2016). Reassessing the goals of grammatical error correction: Fluency instead of grammaticality. Transactions of the Association for Computational Linguistics, 4, 169-182. DOI: https://doi.org/10.18653/v1/P18-1020

Starchenko, V. M., & Starchenko, A. M. (2023). Here we go again: modern GEC models need help with spelling. Proceedings of ISP RAS, 35(5), 215-228. DOI: https://doi.org/10.15514/ISPRAS-2022-35(5)-14

Starchenko, V. M. (2024). No need to get wasteful: The way to train a lightweight competitive spelling checker.Computación y Sistemas, 28(3), 1-12. DOI: https://doi.org/10.13053/CyS-28-4-5068

Vinogradova, O., & Lyashevskaya, O. (2022). Review of practices of collecting and annotating texts in the learner corpus REALEC.International Conference on Text, Speech, and Dialogue (pp. 77-88). Springer International Publishing. DOI: https://doi.org/10.1007/978-3-031-16270-1_7

Volodina, E., Bryant, C., Caines, A., De Clercq, O., Frey, J., Ershova, E., Rosen, A., & Vinogradova, O. (2023). MultiGED-2023 shared task at NLP4CALL: Multilingual grammatical error detection. Linköping Electronic Conference Proceedings (pp. 1-16). LiU Electronic Press. DOI: https://doi.org/10.3384/ecp197001

Wang, C., Li, R., & Lin, H. (2017). Deep context model for grammatical error correction. SLaTE (pp. 167-171).International Speech Communication Association. DOI: https://doi.org/10.21437/SLaTE.2017-29

Wang, Y., Xia, Y., He, T., Tian, F., Qin, T., Zhai, C., & Liu, T. Y. (2019). Multi-agent dual learning. Proceedings of the International Conference on Learning Representations (ICLR).International Conference on Learning Representations.

Wang, Y., Wang, Y., Dang, K., Liu, J., & Liu, Z. (2021). A comprehensive survey of grammatical error correction. ACM Transactions on Intelligent Systems and Technology, 12(5), 1-51. DOI: https://doi.org/10.1145/3474840

Warrens, M. J. (2010). Inequalities between multi-rater kappas. Advances in Data Analysis and Classification, 4(4), 271-286. DOI: https://doi.org/10.1007/s11634-010-0073-4

Xie, Z., Avati, A., Arivazhagan, N., Jurafsky, D., & Ng, A. Y. (2016). Neural language correction with character-based attention. arXiv preprint arXiv:1603.09727. DOI: https://doi.org/10.48550/arXiv.1603.09727

Yannakoudakis, H., Briscoe, T., & Medlock, B. (2011). A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 180-189). Association for Computational Linguistics.

Yuan, Z., & Bryant, C. (2021). Document-level grammatical error correction. Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 75-84). Association for Computational Linguistics.

Yuan, Z., & Felice, M. (2013). Constrained grammatical error correction using statistical machine translation. Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task (pp. 52-61). Association for Computational Linguistics.

Yuan, Z., Briscoe, T., & Felice, M. (2016). Candidate re-ranking for SMT-based grammatical error correction. Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 256-266). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/W16-0530

Zeng, M., Kuang, J., Qiu, M., Song, J. and Park, J. (2024). Evaluating prompting strategies for grammatical error correction based on language proficiency. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 6426-6430). ELRA and ICCL. DOI: https://doi.org/10.48550/arXiv.2402.15930

Zhang, Y., Zhang, B., Li, Z., Bao, Z., Li, C., & Zhang, M. (2022). SynGEC: Syntax-enhanced grammatical error correction with a tailored GEC-oriented parser. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (pp. 2518-2531). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2022.emnlp-main.162

Zhao, J., Fang, M., Pan, S., Yin, W., & Pechenizkiy, M. (2023). GPTBIAS: A comprehensive framework for evaluating bias in large language models. arXiv preprint arXiv:2312.06315. DOI: https://doi.org/10.48550/arXiv.2312.06315

Zhou, H., Liu, Y., Li, Z., Zhang, M., Zhang, B., Li, C., Zhang, J., & Huang, F. (2023). Improving Seq2Seq grammatical error correction via decoding interventions. Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 7393-7405). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2023.findings-emnlp.495

Fighting Evaluation Inflation: Concentrated Datasets for Grammatical Error Correction

Abstract

Downloads

References

Most read articles by the same author(s)

Indexing