Testing
Narjes Khodaparast; Nasim Ghanbari; Abbas Abbasi
Abstract
Among different factors affecting writing assessment, rater and rating scale are two influential variables which determine the outcome of assessment. Taking this into account, this study attempted to identify and classify the raters’ behaviors in the Iranian EFL context when using analytic and ...
Read More
Among different factors affecting writing assessment, rater and rating scale are two influential variables which determine the outcome of assessment. Taking this into account, this study attempted to identify and classify the raters’ behaviors in the Iranian EFL context when using analytic and holistic rating scales. For this aim, a body of nine expert raters were asked to verbalize their thoughts when rating student essays. They were also asked to do their rating using the analytic scale of ESL Composition Profile and IELTS holistic scale. Upon the qualitative analysis of think-aloud protocols (TAPs), two themes emerged which showed the raters’ behaviors when applying the rating scales. The findings further showed that when using the holistic scale, the raters read the text first to get an overall impression. Then they assessed the text based on their own criteria. Next, they referred to the scale for scoring and in the last stage they provided evidence for their scoring. On the other hand, when applying analytic rating scales, the raters first scanned the text for surface features. Then they read the text for their initial impression. Next, they read each scale component and its descriptor for scoring and finally, they attempted to provide evidence for their scoring. In addition to identifying the raters’ behaviors, the raters’ behaviors were classified. The findings imply that the diagnosis of the rater-rating scale interactions can unveil the strengths and weaknesses of the EFL rating process. This, in turn, can provide more quality training for the raters.
Testing
Elham Banisaeed; Mohammad Hashamdar; Kobra Tavassoli
Abstract
Classroom-based assessment (CBA) as one of the constructs of formative assessment has been considered highly significant in recent years. Consequently, various tools have been designed to investigate teachers` CBA needs and deficiencies ignoring different levels of teachers` CBA literacy. Thus, the present ...
Read More
Classroom-based assessment (CBA) as one of the constructs of formative assessment has been considered highly significant in recent years. Consequently, various tools have been designed to investigate teachers` CBA needs and deficiencies ignoring different levels of teachers` CBA literacy. Thus, the present study researchers developed and validated a classroom-based assessment literacy questionnaire (CALQ) to determine teachers` levels of CBAL. To do so, an inclusive review of the literature was accomplished to retrieve major themes and components of CBAL, and then a series of interviews were conducted with five assessment experts and 13 experienced EFL teachers in accordance with Pill and Harding’s (2013) Model of LAL, Hill and McNamara’s (2012) scope and dimensions of CBA in addition to teachers’ assessment literacy beliefs. Accordingly, a questionnaire (CALQ) including 41 items was developed. To inquire the reliability and validity of the CALQ, 318 EFL teachers were selected through non-probability convenience sampling and asked to answer the questionnaire. The outcomes of the Cronbach’s alpha demonstrated a proper reliability index, and factor analysis products clarified that items loaded on six factors named as illiteracy (6 items); nominal literacy (11 items); functional literacy (6 items); procedural and conceptual literacy (6 items); multidimensional literacy (6 items); and assessment literacy beliefs (6 items). Besides, CALQ is considered advantageous in assessing teachers’ CBAL and facilitating materials preparation to design instructional courses and develop EFL teachers’ CBAL, based on the conclusions of structural equation modeling (SEM), which proved that the Model enjoyed good psychometric features.
Testing
Shohreh Bahrami Qalenoee; Jamileh Rahemi
Abstract
Over the past decades, writing assessment research has been concentrating on alternative methods with a social-oriented view of assessment, including dynamic assessment (DA). Given the lack of research juxtaposing the interventionist and interactionist DA frameworks in the area of narrative writing, ...
Read More
Over the past decades, writing assessment research has been concentrating on alternative methods with a social-oriented view of assessment, including dynamic assessment (DA). Given the lack of research juxtaposing the interventionist and interactionist DA frameworks in the area of narrative writing, this study sought to compare the effectiveness of Brown’s graduated prompts model vs. Poehner’s model in the development of one-paragraph narrative essays in terms of grammatical accuracy. The study followed a quasi-experimental design, with 15 Iranian EFL learners selected via convenient sampling from among the female students of a language institute in Tehran. The participants were then randomly divided into three groups: Interventionist group, in which mediation was based on Brown’s model in the sandwich format; interactionist group, where mediation was done using Poehner’s model in the cake format; and non-dynamic assessment (NDA) control group with no mediation involved. The research consisted of three pilot sessions and eleven sessions as the main phase. To analyze the data, both descriptive and non-parametric inferential statistics were run. The results conceded the superiority of both DA approaches to NDA, whereas no significant difference was observed between the two DA groups in their general performance on narrative tasks. However, the analysis of the number and types of required mediational moves over the DA sessions indicated the superiority of the interactionist model to interventionist framework in the development of grammatical accuracy in narrative paragraphs. The study offers some theoretical and pedagogical repercussions for educators, curriculum designers, and L2 teachers.
Testing
Mohammad Reza khodashenas; Hossein Khodabakhshzadeh; Purya Baghaei; Khalil Motallebzadeh
Abstract
Language assessment literacy has been addressed in a wealth of research. However, many studies have attempted to measure teachers’ assessment literacy, there is still a gap that prompted us to investigate the area from the EFL teachers' assessment literacy needs perspective. To accomplish the purpose, ...
Read More
Language assessment literacy has been addressed in a wealth of research. However, many studies have attempted to measure teachers’ assessment literacy, there is still a gap that prompted us to investigate the area from the EFL teachers' assessment literacy needs perspective. To accomplish the purpose, in line with the changes in classroom assessment over the past decades, this study was an attempt to develop and validate an inventory on Teachers Assessment Literacy Needs (TALNs). As the first stage, a set of items was generated through an extensive review of the relevant studies. In the quantitative phase, the developed inventory was administered to 159 English as a foreign language teachers selected through convenience sampling. An inventory construction and validation framework consisting of exploratory analyses was used to examine the construct validity of the proposed inventory. The results indicated that the inventory can be best explained by four components which are knowledge of language assessment literacy, consequences of language assessment literacy, processes of language assessment literacy and teachers’ expectations of Continuing Professional Development (CPD) programs. The TALNs inventory developed in this study aimed to help practitioners and researchers to investigate teachers’ needs in assessment literacy. Fulcher’s (2012) assessment literacy framework was drawn on as the analytic model guiding the study.
Testing
Ali Mohammadi Darabad; Gholam-Reza Abbasian; Bahram Mowlaie; Ali Asghar Rostami Abusaeedi
Abstract
AbstractClassroom performance assessment has gained prominence parallel to the multiplicity of the purposes ahead of assessment. Of many, the major controversy, which was the motive behind this study, is the incorporation of L1-based elicitation as a valid measure of second language (L2) performance ...
Read More
AbstractClassroom performance assessment has gained prominence parallel to the multiplicity of the purposes ahead of assessment. Of many, the major controversy, which was the motive behind this study, is the incorporation of L1-based elicitation as a valid measure of second language (L2) performance assessment. To shed empirical lights on this issue, this explanatory sequential mixed-methods research employed 87 Iranian intermediate EFL learners, whose L2 classroom performance was assessed through L1-based elicitation techniques. In order to validate this mechanism, multi-method mono-trait model (namely, Pearson correlation, structural equations, exploratory and confirmatory factor analysis, composite reliability and convergent validity) suggested by Henning and Mesick’s Unitary Concept of validity were applied. The results from these multiple sources of evidence yield support to their common consensus that L1-based elicitation techniques are valid measures of L2 performance assessment. The findings then offer legacy to the educational implications of L1-based mechanisms both in L2 instruction and assessment.
Testing
Sayyed Mohammad Alavi; Hossein Karami; Mohammad Hossein Kouhpaeenejad
Abstract
Measurement has been ubiquitous in all areas of education for at least a century. Various methods have been suggested to examine the fairness of education tests especially in high-stakes contexts. The present study has adopted the newly proposed ecological approach to differential item functioning (DIF) ...
Read More
Measurement has been ubiquitous in all areas of education for at least a century. Various methods have been suggested to examine the fairness of education tests especially in high-stakes contexts. The present study has adopted the newly proposed ecological approach to differential item functioning (DIF) to investigate the fairness of the Iranian nationwide university entrance exam. To this end, the actual data from an administration of the test were obtained and analyzed through both traditional logistic regression and latent class analysis (LCA) techniques. The initial DIF analysis through logistic regression revealed that 19 items (out of 70) showed either uniform or non-uniform DIF. Further examination of the sample through LCA showed that the sample is not homogeneous. LCA class enumeration revealed that three classes can be identified in the sample. DIF analysis for separate latent classes showed that three serious differences in the number of DIF items identified in each latent class ranging from zero items in latent class 3 to 43 items in latent class 2. The inclusion of the covariates in the model also showed that latent class membership could be significantly predicted from high school GPA, field of study, and acceptance quota. It is argued that the fairness of the test might be under question. The implications of the findings for the validity of the test are discussed in detail.
Testing
Mahmood Dehqan; Seyyedeh Raheleh Asadian Sorkhi
Abstract
Teacher assessment literacy plays a pivotal role in teacher education programs; however, there seems to be a lack of either assessment literacy or its implementation. Using an online assessment course, including both theoretical and practical issues, this mixed method study examined 16 teachers’ ...
Read More
Teacher assessment literacy plays a pivotal role in teacher education programs; however, there seems to be a lack of either assessment literacy or its implementation. Using an online assessment course, including both theoretical and practical issues, this mixed method study examined 16 teachers’ (8 in-service and 8 pre-service) assessment literacy and the extent to which they implement this knowledge. The quantitative part explored participants’ assessment literacy, while the qualitative phase examined the validation of the quantitative results as well as the implementation of assessment literacy in the practical realm. Data were collected via valid and reliable questionnaires, one of which was adapted from Mertler (2003) and the two others were developed by the researchers, along with a practical assessment project. The results indicated that though in-service teachers at their entry behavior were more assessment literate due to their experience, they were at lower degree of assessment literacy at their eventual behavior in comparison with pre-service teachers. The qualitative analysis explored the lack of teachers’ preference for the use of assessment literacy in their classroom practice. The study suggests the inclusion of both theoretical and practical dimensions of assessment literacy in teacher education programs and it proposes doing an in-depth investigation into the difficulties that hinder teachers from putting their theoretical assessment knowledge into practice.
Testing
Seyyed Mohammad Alavi; Mahboube Shahsavar; Mohammad Hossein Norouzi
Abstract
Computerized Dynamic Assessment (CDA), encouraged by Brown and colleagues’ graduated prompt approach, is grounded in Vygotsky’s Socio-Cultural Theory (SCT) of mind and its concept of the zone of proximal development (ZPD). It emerged to respond to the challenge of implementing DA in large ...
Read More
Computerized Dynamic Assessment (CDA), encouraged by Brown and colleagues’ graduated prompt approach, is grounded in Vygotsky’s Socio-Cultural Theory (SCT) of mind and its concept of the zone of proximal development (ZPD). It emerged to respond to the challenge of implementing DA in large classes and to meet the psychometric properties of assessment. To this end, the present study attempted to design a unique computerized dynamic assessment tool to diagnose learners’ development of pragmatic competence, specifically their knowledge of the speech acts of apology and request. To conduct the research, a number of 60 BSc students of engineering, aged 18-24, participated in the study. They had different proficiency levels, including pre-intermediate, intermediate and upper-intermediate levels. In the course of CDA, they were provided with 30 multiple choice discourse completion tests of apology and request and they were required to choose what they would say in that specific situation. The participants received pre-established meditational hints for each of the unacceptable responses, which were arranged from the most implicit to the most explicit. Finally, to diagnose learners’ development, their test performance, including their actual score, mediated score and learning potential score (LPS), was instantly displayed. Paired samples t-test showed development in learners’ mediated score. The results of the univariate analysis of variance showed that there is no interaction between mediation and proficiency level. Teachers can use this supplementary dynamic assessment tool to diagnose learners’ development of pragmatic competence.
Testing
Hamdollah Ravand; Gholamreza Rohani; Tahereh Firoozi
Abstract
The generalizability aspect of Construct validity, as proposed by Messick (1989), requires that a test measures the same trait across different samples from the same population. Differential Item functioning (DIF) analysis is a key component in the fairness evaluation of educational tests. University ...
Read More
The generalizability aspect of Construct validity, as proposed by Messick (1989), requires that a test measures the same trait across different samples from the same population. Differential Item functioning (DIF) analysis is a key component in the fairness evaluation of educational tests. University entrance exam for the candidates who seek admission into master's English programs (MEUEE) at Iranian state universities is a very high stakes test whose fairness is a promising line of research. The current study explored gender and major DIF in the general English (GE) section of the MEUEE using multiple-indicators multiple-causes (MIMIC) structural equation modelling. The data of all the test takers (n=21,642) who took the GE section of the MEUEE in 2012 were analyzed with Mplus. To determine whether an item is flagged for DIF both practical and statistical significance were considered. The results indicated that 12 items were flagged for DIF in terms of statistical significance. However, only 5 items showed practical significance. The items flagged for DIF alert the test developers and users to potential sources of construct-irrelevant variance in the test scores which may call into question comparison of the test takers’ performance, especially when the tests are used for selection purposes.
Testing
Alireza Ahmadi
Abstract
The studies conducted so far on the effectiveness of resolution methods including the discussion method in resolving discrepancies in rating have yielded mixed results. What is left unnoticed in the literature is the potential of discussion to be used as a training tool rather than a resolution method. ...
Read More
The studies conducted so far on the effectiveness of resolution methods including the discussion method in resolving discrepancies in rating have yielded mixed results. What is left unnoticed in the literature is the potential of discussion to be used as a training tool rather than a resolution method. The present study addresses this research gap by exploring the data coming from rating behaviors of 5 Iranian raters of English. Qualitative analysis of the data indicated that the discussion method can serve the function of training raters. It helped raters rate more easily, quickly and confidently. Furthermore, it helped them improve their understanding and application of the rating criteria and enabled them justify their scoring decisions. Many-faceted Rasch analysis also supported the beneficial effects of discussion in terms of improvement in raters’ severity, consistency in scoring, and the use of scale categories. The findings provide insight into the potential of discussion to be used as a training tool especially in EFL contexts in which lack of access to expert raters can be an obstacle to holding training programs. The author argues for future studies to focus on how discussion may function depending on the rating scale used.