Assessing Speech Proficiency in Persian: A Comparative Analysis of Artificial Intelligence Capabilities and the Saadi Foundation Reference Standard

Document Type : Research Paper

Authors

1 Islamic Language and Education Center, Al-Mustafa International University, Qom, Iran.

2 almustafa international university

3 Persian Language Instructor at Al-Mustafa International University

Abstract

The remarkable progress in Artificial Intelligence (AI), particularly in Large Language Models (LLMs), has significantly transformed the landscape of automated language proficiency assessment. While this technology has achieved success in evaluating formal and quantitative linguistic components, its adaptability to standardized frameworks that emphasize communicative competencecontent accuracy, and social interaction remains an unresolved theoretical and technical challenge, particularly within the context of the Persian language. This study aims to delineate the epistemological and technical gaps of AI in conforming to the stringent requirements of the Saadi Foundation Reference Standard and to establish the boundaries of machine competence. This applied research employed a critical documentary analysis methodology. Functional statements and speaking skill descriptors across the seven levels of the Saadi Standard (ratified 2016) were extracted. Subsequently, these expectations were subjected to qualitative analysis through a comparative matrix, juxtaposing them against the inherent technical architecture and limitations of Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) algorithms in low-resource languages. The theoretical arguments were supported by analogical data on Word Error Rate (WER) in comparable L2 contexts and observational data from LLM interactions. The analysis reveals a significant inverse correlation between "skill-level complexity" and "machine-assessment validity." At Novice and Elementary levels, the machine demonstrates valid and substitutable performance due to the formal, quantitative, and static nature of the indicators (e.g., pronunciation accuracy, basic vocabulary). However, a deep performance gap emerges at the Intermediate (which the Saadi Standard separates into three sub-levels, unlike the CEFR) and Advanced levels. Findings indicate that the machine's inability to interpret compensatory strategies, its blindness to cultural background knowledge, its failure to detect affective tone, its bias against non-standard accents, and its deficiency in evaluating content accuracy (stemming from the phenomenon of AI hallucination at higher levels) pose a serious threat to construct validity. Exclusive reliance on machine assessment for high-stakes testing at advanced levels leads to a negative washback effect, reducing language education to mechanical, quantifiable patterns. Consequently, the study proposes a Hierarchical Hybrid Assessment Model (Machine for Form / Human for Meaning), featuring revised human-referral criteria based on sub-score inconsistency, as an optimal and scientifically-grounded solution.

Keywords

Main Subjects


  • منابع  

    • قاسمی، مهدی؛ برومند تمبکی، شهداد. (1403). «بررسی تأثیر هوش مصنوعی (AI) بر یادگیری مهارت‌های زبانی در آموزش آنلاین». اولین کنفرانس بین‌المللی مطالعات کاربردی در فرایندهای تعلیم و تربیت. بندرعباس. .https://civilica.com/doc/2247368
    • مظهرپور، دیار؛ سیدکلان، سیدمحمد. (1403). «سنتزپژوهی کاربرد چت‌بات‌ها (نرم‌افزار هوش مصنوعی) در آموزش زبان انگلیسی». پژوهش در مطالعات برنامه‌ درسی. دوره چهارم. شماره1. صص: 43-64. https://doi.org/10.48310/jcdr.2024.17527.1115.
    • صبوری، سپهر؛ حاج ملک، محمدمهدی. (1402). «استفاده از ظرفیت‌های هوش مصنوعی در آموزش تلفظ زبان‌های خارجی». نهمین کنفرانس بین‌المللی وب‌پژوهی.

     

    • Bender, E. M. & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics(pp. 5185–5198). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.ACL-MAIN.463
    • Chapelle, C. A. & Voss, E. (Eds.). (2021). Validity argument in language testing: Case studies of validation research. Cambridge University Press. https://assets.cambridge.org/97811084/84022/frontmatter/9781108484022_frontmatter.pdf
    • Huth, T. (2020). Testing interactional competence: Patterned yet dynamic aspects of L2 interaction. Papers in Language Testing and Assessment, 9(1), 1–25.
    • Ie, X. & Jaeger, T. F. (2020). Comparing non-native and native speech: Are L2 productions more variable? The Journal of the Acoustical Society of America, 147(5), 3322–3347. https://doi.org/10.1121/10.0001141
    • Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A. & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1–38. https://doi.org/10.1145/3571730
    • Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000
    • Kittler, M. G., Rygl, D. & Mackinnon, A. (2011). Beyond culture or beyond control? Reviewing the use of Hall's high-/low-context concept. International Journal of Cross Cultural Management, 11(1), 63–82. https://doi.org/10.1177/1470595811398797
    • Kordzadeh, N. & Ghasemaghaei, M. (2022). Algorithmic bias: Review, synthesis, and future research directions. Information Systems Frontiers, 24(5), 1321–1340. https://doi.org/10.1080/0960085X.2021.1927212
    • Liu, X. J., Wang, J. & Zou, B. (2025). Evaluating an AI speaking assessment tool: Score accuracy, perceived validity, and oral peer feedback. Journal of English for Academic Purposes, 75, 101505.
    • Manggiasih, L. A., et al. (2023). Strengths and limitations of SmallTalk2Me app in English language proficiency evaluation. TELL Journal, 11(2), 146–157.
    • Nigmatulina, I., Kew, T. & Samardžić, T. (2020). ASR for non-standardised languages with dialectal variation: The case of Swiss German. In M. Zampieri, P. Nakov, N. Ljubešić, J. Tiedemann & Y. Scherrer (Eds.),Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects (pp. 15–24). International Committee on Computational Linguistics (ICCL). https://aclanthology.org/2020.vardial-1.2/
    • Raud, N. (2025). Automatic assessment of L2 interactional competency[Master’s thesis, Aalto University].
    • Santos, S. C., Kapadia, A. & Feinberg, D. R. (2025). Hearing people speak in different accents biases voice discrimination. Scientific Reports, 15, 30775. https://doi.org/10.1038/s41598-025-13117-w
    • Zhang, M., Bridgeman, B. & Davis, L. (2019). Validity considerations for using automated scoring in speaking assessment. In Automated speaking assessment(pp. 174–185). Routledge.
    • Zou, B., et al. (2024). Exploring EFL learners’ perceived promise and limitations of using an artificial intelligence speech evaluation system. System, 126, 103497.