Modeling Student Evaluations of Writing and Authors as a Function of Writing Errors

Writers are often judged by their audience, and these evaluations can encompass both the text and the authors. This study built upon prior research on writing evaluation and error perceptions to examine how interconnected or separable are these judgments. Using a withinsubjects design, college students evaluated four essays demonstrating no errors, lower-level errors, higher-level errors, or both types. Evaluations included writing quality traits (e.g., conventions, ideas, organization, sentence fluency, and voice) and author characteristics (e.g., creativity, intelligence, generosity, and kindness). Exploratory factor analyses identified latent constructs within these ratings. One construct, Writing Quality and Skill, appeared to combine writing traits and authors’ intellectual ability (e.g., intelligence and knowledgeability). The second construct, Author Personality, seemed to comprise interpersonal author traits (e.g., kindness and loyalty). The two constructs were significantly and positively correlated. These results suggest that students tended to form holistic impressions of writing quality and authors rather than distinct judgments about individual traits. The spillover onto perceptions of authors’ personal characteristics may be representative of latent biases. Student raters were also more sensitive to lower-level errors than higher-level errors. Implications for biases and training related to peer assessment are discussed.


Introduction
Writing and writing evaluation are complex processes that require the development of substantial knowledge and meta-knowledge about language, text, genre, composition, and communication (e.g., Elton, 2010;Flower, Hayes, Carey, Schriver, & Stratman, 1986;Olinghouse, Graham, & Gillespie, 2014;Panadero & Jonsson, 2013;Reiff & Bawarshi, 2011;Wang & Engelhard, 2019). One specific application of this expertise pertains to detection and assessment of writing errors. There are numerous prescriptions, genre conventions, or other 'rules of writing' to consider (Devitt, 2004;Hacker & Sommers, 2016;Hyland, 2007), such as rules for spelling, grammar, and punctuation. Similarly, writing genres might specify criteria for evidence and logical reasoning (e.g., argumentative writing) or characterization and plotting (e.g., narrative writing). Moreover, there are many ways to write well (Crossley, Roscoe, & McNamara, 2014), and variations in style and content interact with audience and background (Magnifico, 2010;McNamara, 2013). Writing evaluators must decide when and whether expectations have been violated-which one might refer to as 'writing errors'-and the complex and subjective nature of writing evaluation means that these decisions could be susceptible to bias or other misleading beliefs. Even experienced raters can be influenced by factors such as race, gender, and class-for example, texts written in African-American Vernacular English may be judged as of lower quality than texts written in Standard American English (Godley & Escher, 2012;Johnson & VanBrackle, 2012). students were more positive than the instructors, but with increasing expertise (i.e., lower-ability undergraduates vs. higher-ability undergraduates vs. graduate students vs. postgraduate students) this difference decreased (Patchan et al., 2009;Cho, Schunn, & Charney, 2006;Topping et al., 2000). Expertise also affected the focus of the feedback provided. For example, Patchan et al. (2009) compared peer feedback generated by history undergraduates to their history instructor and a writing instructor. The history instructor primarily noted issues with the history content, whereas the writing instructor focused on solutions to high prose issues. The students usually fell somewhere between the two instructors. Patchan et al. (2011) also compared peer feedback generated by physics undergraduates versus their non-native English-speaking graduate student teaching assistants (TAs). The students provided longer comments and focused more often on high prose than the TAs, and they provided feedback about the physics content just as often as the TAs. Overall, students are able to assess writing and writing errors, but tend to focus on superficial issues or complementary issues compared to teachers (e.g., Topping, et al., 2000).

Prior Study
In a previous study, Johnson et al. (2017, and see Method) focused specifically on college students' evaluations of writing as a function of writing error patterns, and further considered how these evaluations extended to judgments about authors' personal characteristics. That study addressed two primary research questions: how do lower-and higher-level errors influence students' ratings of (1) writing quality and (2) author characteristics?
To answer the above questions, the researchers constructed a set of essays that exhibited 'no errors,' only 'lower-level errors' (e.g., spelling, grammar, and punctuation), only 'higher-level errors' (e.g., ideas, argument, and organization), or both. Participating students then rated four essays that each (a) exhibited a distinct error pattern and (b) appeared to be written by different authors. Ratings included eight writing traits (e.g., conventions, organization, sentence fluency, and voice) and eight author traits (e.g., intelligence, generosity, kindness, and knowledgeability). Researchers analyzed trait ratings as distinct judgments-means, standard deviations, and intercorrelations (see Johnson et al., 2017) were reported for each trait and each error pattern. The authors also reported average 'writing trait' and 'author trait' ratings that aggregated all traits within their respective categories. Johnson et al. (2017) observed that the presence of writing errors led college students to perceive both writing quality and authors more negatively. When essays exhibited errors, students gave significantly lower ratings regarding writing traits (e.g., conventions, organization, and sentence fluency, and also lower ratings on eight author traits (e.g., generosity, kindness, and intelligence. These effects were observed for both lower-level mechanical errors (e.g., spelling and grammar) and higher-level conceptual and rhetorical errors (e.g., missing theses, contradictory arguments, and off-topic examples). Importantly, the effects were stronger for lower-level errors. Some college students did not notice the higher-level errors at all (i.e., gave equivalent ratings to essays with 'higher-level errors only' versus essays with 'no errors'). A key finding, however, was that students indeed made unwarranted judgments about authors based on writing errors-there was no reason to infer that a person was less generous, kind, or loyal due to typos or muddled arguments, and yet students appeared to make such inferences.
One limitation of the prior study is that analyses of either separate or aggregate ratings implied assumptions about how the judgments were (or were not) interconnected. It is possible that students generated distinct evaluations for each trait (i.e., eight writing traits and eight author traits). For instance, when rating 'sentence fluency' and 'organization,' students may have considered these text qualities independently. Likewise, students may have made separate judgments about authors' 'kindness' or 'generosity.' An alternative possibility is that students conceptually combined one or more traits-that is, students' assessments of sentence fluency and organization, or of kindness and generosity, may have been driven by a muddled or blended understanding of these constructs. More importantly, the same unanswered questions apply to whether students evaluated writing quality separately from author characteristics. Perhaps students made only a single holistic judgment that a text was 'good' or 'bad,' which then influenced their ratings of all individual writing and author traits.
A more technical way to frame these questions is in terms of the latent constructs employed by student raters (i.e., factor analysis). Do aspects of writing quality (e.g., conventions and organization) load on one or more latent factors? And, are those factors separate from author traits (e.g., intelligence and generosity)?
Alternatively, perhaps writing quality and author characteristics load on a single latent factor, implying that they are, in practice, a singular assessment construct. A related issue is how various error patterns influence this interplay between writing and author judgments. Are these evaluations more or less interwoven when a text is relatively free of errors, exhibits only lower-level errors (e.g., spelling), only higher-level errors (e.g., illogical arguments), or both kinds of errors? Prior research has found that different error patterns are not perceived equally by student raters (Johnson et al., 2017), and thus the presence of different errors might plausibly affect students' latent assessment constructs. With respect to peer assessment of writing, answers to these questions have implications for the extent to which students' evaluations of writing may incorporate interpersonal biases and how training might decouple or address this overlap.

Materials and Methods
The current work entails an extended new analysis of previously collected data. Complete details about data collection (i.e., population, sampling, measures, and materials) is reported in Johnson et al. (2017). However, essential methodological details are reiterated here for clarity.

Participants
Undergraduate students (n = 70) from a large university in the southwestern United States were recruited from Introduction to Psychology courses and compensated via course credit. Participants self-reported a mean age of 20-21 years (M = 20.7, SD = 4.7), with 34.3% identifying as female. Participants identified as African-American (2.9%), Asian (15.7%), Caucasian (42.9%), Hispanic (8.6%), Middle Eastern (22.9%), or Other (7.1%, including multiethnic individuals). The sample was primarily freshmen (54.3%), but also included sophomores, juniors, and seniors. Participants reported a range of academic majors including aviation, business, computing, engineering, life sciences, or other/undeclared.

Research Design and Essay Materials
The study employed a one-way, within-subjects design in which all participants read and rated a total of four essays that each demonstrated a different error pattern: No Errors, Low-Level Errors Only, High-Level Errors Only, or All Errors. The essays were constructed by the researchers (see Johnson et al., 2017), but participants were informed that each essay was authored by another student. Specifically, the researchers created essays ostensibly written by four different student authors who expressed unique positions, arguments, and examples. Participants were not given information about the supposed student authors' supposed background (e.g., race or native language) or writing tools (e.g., access to spelling and grammar checking software).
To construct the essay stimuli, the researchers initially drafted four original argument essays in response to a prompt on 'patience' that asked, 'Is it better for people to act quickly and expect quick responses from others rather than to wait patiently for what they want?' These initial essays were revised until all or most mechanical and conceptual errors were removed-subsequently referred to as No Errors essays. For Low-Level Errors Only essays, every paragraph was modified to include errors in spelling and homophones, capitalization, sentence fragments or run-ons, commas or apostrophes, and verb-noun or tense agreement. However, these essays still contained clear thesis statements, topic sentences, and relevant examples. In contrast, High-Level Errors Only essays were mechanically correct but modified to exhibit missing thesis statements and topic sentences, missing evidence, off-topic examples, and contradictory evidence. Finally, All Errors essays included both error patterns. Altogether, each of the four error patterns was implemented for each of the four 'student authors,' resulting in 16 total essays. All essays were about 600-650 words in length. To confirm that the constructed essays demonstrated the intended experimental conditions, four expert raters categorized the error pattern of all stimulus essays. Raters exhibited 95.3% accuracy (i.e., experts' categorizations matched the intended patterns) and three of the four raters exhibited perfect accuracy, suggesting that the essay creation process was successful.
Participants were randomly assigned to read and rate four essays such that each error pattern and supposed student author were encountered only once. The sequence of error patterns and authors, along with the errorauthor pairings, were systematically randomized across participants to control for order effects. Importantly, participants had no knowledge of the experimental manipulation or the intended error pattern of the essays.

Background Survey
Participants reported their age, gender, race/ethnicity, school year, and academic major at the start of the study.

Writing Quality Ratings
Immediately after reading each essay, participants rated eight writing quality traits. Six traits were selected based on the Six Traits Writing Rubric (Spandel, 2000): Conventions, Ideas and Content, Organization, Sentence Fluency, Voice, and Word Choice. Two additional traits, Enjoyment and Persuasiveness, sought to elicit evaluations of how pleasurable and convincing the essays were, respectively. Participants were introduced to the traits along with brief, concrete descriptions framed as question prompts (see Table 1). However, formal rubricreferenced training was not provided because our aim was to investigate students' perceptions of writing, author, and errors rather than adherence to a rubric (i.e., a detailed rubric might have heavily influenced the perceptions). Participants rated their agreement with a series of statements (see Table 1) on a scale of 1 ('Very Strongly Disagree') to 10 ('Very Strongly Agree') (one statement per trait).

Author Characteristics Ratings
Participants judged eight student author traits: Creativity, Generosity, Hard-working, Intelligence, Kindness, Knowledgeability, Loyalty, and Thoughtfulness (see Table 1). Several traits were somewhat more intellectual (e.g., Creativity, Intelligence, Knowledgeability, and Thoughtfulness) and others were more interpersonal (e.g., Generosity, Hard-working, Kindness, and Loyalty). Participants again rated their agreement with a series of statements (see Table 1) on a scale from 1 ('Very Strongly Disagree') to 10 ('Very Strongly Agree').

Procedure
Participants completed the study in a single session (60-90 minutes) that included informed consent and all rating tasks. Ratings of each essay were made immediately after reading that essay. Participants reviewed only one essay at a time.

Exploratory Factor Analysis
To assess whether raters' latent judgments of writing quality and author characteristics were distinct or overlapping (RQ1, RQ2, and RQ3), we conducted an exploratory factor analysis (EFA) using Mplus v.8.0 software (Muthén & Muthén, 1998. Four separate EFAs were conducted for each of the four essay types: No Errors, Low-level Errors, High-level Errors, and All Errors (RQ4). Each EFA included all eight essay quality and eight author characteristics variables (i.e., 16 variables per EFA). Maximum likelihood with robust standard errors (MLR) was selected as the estimation method due to the small sample size (n = 70) and because skewness and kurtosis statistics for several essay quality variables indicated statistically significant departures from univariate normality. MLR estimation is robust against violations of normality assumptions, and more appropriate for use in small samples, than the default estimation procedure of maximum likelihood (Byrne, 2013).
Given the small sample size, three methods were implemented to evaluate the adequacy of the sample size for exploratory factor analysis (EFA). For each EFA, we first considered the Kaiser-Mayer-Olkin (KMO) statistic, which reports values ranging from 0.00 to 1.00. Within this range, values between 0.70-0.80 signify 'good' sampling adequacy, 0.80-0.90 are 'very good,' and values above 0.90 are 'excellent' (Field, 2013). Second, we considered the number of variables whose communalities were above 0.60. Variable communality is the proportion (ranging from 0.00 to 1.00) of the variance that a measure shares with other measures. MacCallum, Widaman, Zhang, and Hong (1999) stated that when all communalities are above 0.60, smaller sample sizes (n < 100) may be acceptable. Finally, we also considered the number of factor loadings per factor that were ≥ .60. Guadagnoli and Velicer (1988) stated that a factor with four or more loadings ≥ 0.60 is reliable regardless of the sample size (see also Beavers et al., 2013). These sampling appropriateness metrics are reported for each analysis.
Prior to proceeding with EFA, it was also necessary to ensure the presence of sufficient covariation among the observed variables. Descriptive statistics are presented in Table 2, and Tables 3 and 4 present the correlation matrices. With a handful of exceptions, correlations were generally moderate to strong, indicating that EFA was appropriate. Given the moderate to large correlations among the measures within each essay type, an oblique rotation was selected, which allows for correlations among the extracted factors. This approach generates a simple structure while allowing the factors to be correlated. Thus, EFA models were estimated to test between one and six latent factors. An upper limit of six factors was selected because that would result in factors with Table 1 Writing and Author Traits, Prompts, and Assessment Statement

Conventions
Does the essay show correct use of spelling, capitalization, punctuation, and grammar?
The essay correctly followed writing conventions (spelling, punctuation, and grammar).

Enjoyable
Is the essay enjoyable or interesting to read?
The essay was enjoyable and interesting to read.
Ideas and Content Does the essay include a clear main idea? Are ideas supported with relevant details?
The essay contained good ideas and content (main ideas and supporting details).

Organization
Is the essay logically organized? Does the essay include a clear introduction and conclusion?
The essay was organized well (structure, introduction, and conclusion).

Persuasive
Is the essay persuasive and convincing?
The essay was persuasive and convincing.
Sentence Fluency Does the essay have a smooth flow? Does the essay show effective sentence variety?
The essay demonstrated effective sentence fluency (rhythm, flow, and variety).

Voice
Does the essay convey a clear personality? Does the essay demonstrate awareness of the audience?
The essay demonstrated a clear voice (personality and sense of audience).

Word Choice
Does the essay include carefully chosen wording? Does the essay include vivid images?
The essay used effective word choice (precise and vivid wording).

Author Traits
Creativity Is the author a creative and innovative person?
The author is a creative person.

Effort
Is the author a hard-working person?
The author is a hard-working person. Generosity

Is the author a generous and giving person?
The author is a generous person. Intelligence

Is the author an intelligent and smart person?
The author is an intelligent person.

Kindness
Is the author a kind and caring person?
The author is a kind person.

Knowledge
Is the author a knowledgeable and well-read person?
The author is a knowledgeable person.

Loyalty
Is the author a loyal and supportive person?
The author is a loyal person.

Thoughtfulness
Is the author a thoughtful and reflective person?
The author is a thoughtful person.
fewer than three variables, which would likely terminate the estimation procedures.   When evaluating the results of the EFA analyses, four metrics were inspected to select the optimal factor solution and determine whether to retain or omit a given factor. First, Kaiser's criterion (1974; see also Ruscio & Roache, 2012) was used to select the appropriate number of factors, which retains factors whose eigenvalues are ≥ 1.00. In addition, scree plots were examined to identify the point of inflection. The number of factors above the point of inflection were retained. Second, parallel analysis (Horn, 1965) was used to retain the number of factors whose eigenvalues are larger than the corresponding eigenvalues from the parallel analysis (see also Ruscio & Roache, 2012). Third, factors were retained if corresponding variables demonstrated appreciable loadings in the range of ≥ 0.60. Extracted factors were dropped if they had no corresponding variables with appreciable loadings. Fourth, factor solutions were selected based on their interpretability. Ultimately, EFA is a statistical method to arrive at theoretical understanding of a phenomenon. Therefore, an interpretable EFA solution is preferred when there are competing models and the other selection criteria are unclear.
Finally, three model fit indices were inspected to determine models of latent judgments: chi-square statistics, root mean square error of approximation (RMSEA; Steiger & Lind, 1980), and standardized root mean square residual (SRMR). First, non-significant chi-square values indicate a good fitting model. In the absence of a non-significant chi-square value, models with lower values are considered better fitting than models with higher values. Second, when available, RMSEA is a measure of model fit that accounts for the number of parameters in the model. RMSEA values less than 0.05 indicate good fitting models; values greater than 0.10 indicate poor fitting models (Brown, 2006;Kenny, 2014). Third, SRMR evaluates differences between modelimplied correlations and the correlations observed in the data (Brown, 2006). SRMR values less than 0.08 indicate good fit; whereas, 0.00 indicates perfect fit.

Overview
Across all four error patterns, a two-factor solution was generally optimal and the observed latent factors were largely consistent (see RQ4). The first and most dominant factor tended to include all or most of the eight writing quality traits, along with ratings of the student authors' intellectual ability to a lesser extent (e.g., intelligence, knowledge, or creativity). Thus, in answer to RQ1, students generally made holistic judgments about writing quality rather than distinct judgments along one or more writing traits. However, in answer to RQ3, it appears that this construct also incorporated aspects of author evaluations. For this reason, this factor might be labeled Writing Quality and Skill-a holistic evaluation of the quality of the essay that also reflects authors' writing ability and mental resources. The second and less dominant factor consistently included interpersonal student author traits perhaps indicative of authors' personality (i.e., generous, kind, or loyal) and sometimes included traits related to effort or conscientiousness. In answer to RQ2, it appeared that students made holistic judgments about authors' interpersonal characteristics rather than nuanced evaluations of individual qualities. Overall, this second factor might be labeled Author Personality.
Writing Quality and Skill was consistently and positively correlated with Author Personality. Thus, although they may be separate judgments, they do influence each other. Poor quality writing reflects negatively on the author as a person; writers who seem less kind or generous may inspire readers to be much more critical. With regard to RQ4, consistency across error patterns suggested that the presence of different errors did not dramatically change how students rated writing or authors, although semantic and rhetorical errors may inspire broader judgments of the authors' personality. That is, student raters may view such errors as more 'revealing' about the author than are spelling or grammatical errors. This subtle pattern is discussed further in results for each error pattern.

Writing and Author Ratings of 'No Errors' Essays
For the No Errors essays ratings, metrics of sampling adequacy indicated that it was appropriate to proceed with EFA. The KMO statistic was 0.83 ('very good') and 13 of the 16 variables exhibited communalities greater than or equal to 0.60-only slightly below the criterion suggested by MacCallum et al. (1999).
Analyses considered models from one to six latent factors. All models converged except the six-factor model, which was omitted (see Table 5). After considering all criteria, a two-factor model was selected as the optimal and most parsimonious model. Kaiser's criterion and the scree plot indicated a three-factor model, but the parallel analysis favored a two-factor model. Inspection of the factor loadings also favored a two-factor model. All models yielded a significant chi-square statistic, which indicated possible model misspecification. RMSEA was below the 0.10 threshold only for the five-factor model, and SRMR was acceptable for the four-factor model (0.04) and the five-factor model (0.03).
The pattern matrix for the two-factor solution is presented in Table 6. Nine variables loaded on the first factor with values of 0.60 or higher, and three variables meeting this criterion loaded on the second factor (i.e., slightly below the threshold of four or more variables per factor; see Beavers et al., 2013;Gaudagnli & Velicer, 1988). The first and second factors explained 47% and 12% of the shared variance, respectively.
The first factor comprised all eight writing quality ratings and one author characteristic rating (i.e., Intelligence). A second author characteristic (i.e., Knowledgeability) was just below the threshold for inclusion (loading = 0.58). This factor suggests two findings. First, raters were generally making a holistic evaluation of writing rather than distinct judgments of conventions, organization, fluency, and so on-perhaps indicating a halo effect (e.g., Engelhard, 1994;Knoch, Read, & Randow, 2007; and see Gansle, VanDerHeyden, Noell, Resetar, & Williams, 2006). Second, writers' intellectual abilities (i.e., intelligence and perhaps knowledgeability) seemed to also be embedded in judgments of writing quality.
The second factor comprised three author characteristics (i.e., Generosity, Kindness, and Loyalty), all of which were interpersonal rather than intellectual. No writing quality ratings loaded on this factor. This pattern suggests that judgments of authors' personality were distinct from judgments of writing or intellectual ability when evaluating texts without lower-or higher-level errors. However, the two factors were moderately correlated, r = .46, p < .05, suggesting that although writing and personality judgments were separable, they did influence one another. Writing perceived as lower quality may have led to harsher judgments of authors' generosity, kindness, and loyalty. Conversely, perhaps perceptions of the author as unkind or disloyal (e.g., stemming from reactions to essay content) led raters to be more critical of the writing. Although a correlation does not permit a clear determination, the former interpretation seems more likely.

Writing and Author Ratings for 'Low-Level Errors Only' Essays
Sampling metrics indicated that it was appropriate to proceed with EFA. The KMO statistic was .93 ('excellent'), and initial communalities of all variables were greater than or equal to .60.
Models with one to six factors were considered. All but the six-factor model converged, which was omitted (see Table 5). After considering model fit criteria, a two-factor model was selected as optimal. Specifically, Kaiser's criterion and the scree plot indicated a two-factor model, although the parallel analysis favored a one-factor model. Inspection of the factor loadings also led to favoring a one or two-factor model. All five models yielded statistically significant chi-square values, indicating some misspecification to the model. The RMSEA dropped below the 0.10 threshold for the four-factor model. The SRMR became acceptable starting with the two-factor model and decreased with each successive model. The other models had insufficient numbers of appreciable loadings on one or more variables to warrant a meaningful factor.
The pattern matrix for the two-factor solution is presented in Table 6. The first factor had nine loadings of 0.60 or higher, and the second factor had three loadings that met this criterion (i.e., slightly below the threshold of four or more variables per factor). The first and second factors explained 64% and 8% of the shared variance among observed variables, respectively. The first factor comprised seven writing quality ratings (i.e., all traits except Conventions) and two author characteristics (i.e., Intelligence and Knowledgeability). One additional author characteristic was slightly below threshold (i.e., Creativity, loading = 0.56). The second factor comprised three interpersonal author characteristics (i.e., Generosity, Kindness, and Loyalty).
Overall, these findings largely replicate patterns from the No Errors essays. When evaluating texts with lowlevel errors only, raters seemed to make a holistic judgment of writing quality that incorporated the authors' intellectual abilities, and seemed to make a separate judgment of authors' personality. The two factors were again moderately correlated, r = .64, p < .05, and the relationship was stronger than with No Errors essays.

Writing and Author Ratings for 'High-Level Errors Only' Essays
Sampling metrics indicated that it was appropriate to proceed with EFA. The KMO statistic was .89 ('very good'), and initial communalities for 15 of the 16 variables were greater than or equal to .60.
Only the one-and two-factor models converged (see Table 5); models of three or more factors were thus omitted. After considering model fit criteria, a two-factor model was selected as optimal. Kaiser's criterion, the scree plot, and the parallel analysis all favored a two-factor model. Both models yielded a significant chi-square statistic, indicating some misspecification to the model. The RMSEA for the one-factor model was 0.15, indicating a poor-fitting model. The RMSEA of the two-factor model yielded a minimally acceptable value of 0.10. The SRMR for the two-factor solution was 0.04, indicating adequate model fit.
The pattern matrix for the two-factor solution is given in Table 6. The first factor had seven loadings of 0.60 or higher, and the second factor had six loadings that met this criterion. The first and second factors explained 62% and 10% of the shared variance among observed variables, respectively. The first factor comprised seven writing traits (i.e., all except Conventions) and no author characteristics, although Intelligence was just below threshold (loading = 0.56). The second factor comprised six author traits (i.e., Generosity, Hard-working, Kindness, Knowledgeability, Loyalty, and Thoughtfulness), and Creativity was just below threshold (loading = 0.59). Note. N = 70. Factors rotated using Oblimin rotation. Appreciable factor loadings (≥ 0.60) are indicated using bold font. * p < .05.
These patterns both corroborate and diverge from prior findings. First, raters again seemed to make broad writing quality evaluations and author personality judgments that were distinct but related (r = .64, p < .05), with a similar magnitude as Low-Level Errors Only essays. Although not significant, Intelligence somewhat loaded on the writing quality factor but not the author personality factor-again suggesting that perceived intellectual abilities play a role in writing judgments. However, the author personality factor included many more characteristics in this model. In addition to interpersonal traits like being generous or kind, this factor also included work ethic and thoughtfulness along with possible intellectual traits.
The critical difference between this model and prior models was the presence of higher-level writing errors, such as disorganization, missing arguments, and illogical examples. Compared to essays exhibiting no errors or only lower-level errors (e.g., spelling and grammar), raters seemed to make more sweeping evaluations of the authors themselves.

Writing and Author Ratings for 'All Errors' Essays
Sampling metrics indicated that it was appropriate to proceed with EFA. The KMO statistic was .92 ('excellent'), and initial communalities of all variables were greater than or equal to .60.
Models for the one-through four-factor solutions converged, but not the five-factor and six-factor solutions, which were omitted (see Table 5). Based on review of model fit metrics, the two-factor solution was retained as the optimal model. Kaiser's criterion and the scree plot recommended the two-factor solution although the parallel analysis pointed to a one-factor solution. Inspection of factor loadings revealed that the three-factor solution only had one appreciable loading (i.e., a loading ≥ .60) on the third factor, and the four-factor solution had no appreciable loadings on the fourth factor. All models yielded a significant chi-square statistic, indicating some misspecification to the model. RMSEA values presented similar information. In all cases, the RMSEA exceeded 0.10, although the lower bound of the 95% confidence interval for the two-factor solution dropped below this threshold. The SRMR for the two-factor solution was 0.04, indicating adequate fit.
The pattern matrix for the two-factor solution is presented in Table 6. The first factor had eight loadings of 0.60 or higher, and the second factor had four loadings that met this criterion. The first and second factors explained 66% and 8%, respectively, of the shared variance among the observed variables. The first factor comprised all eight writing quality traits and no author characteristic ratings, although Knowledgeability was just below threshold (loading = 0.57). The second factor comprised four author characteristic ratings (i.e., Generosity, Kindness, Loyalty, and Thoughtfulness), with two others near the threshold (i.e., Creativity, loading = 0.56; and Intelligence, loading = 0.55).
The pattern for All Errors essays was most similar to the model for High-Level Errors Only essays, although the model again corroborated prior results. Raters seemed to make a holistic judgment of writing quality that included elements of intellectual ability, and distinct but strongly related judgments of authors' personality (r = .75, p < .05). Essays that exhibited both kinds of errors also demonstrated the strongest correlation between factors. As above, the presence of higher-level writing errors seemed to result in somewhat broader evaluations of the authors-not just interpersonal traits (e.g., loyalty and kindness) but also effort and intellectual traits.

Discussion
For better or worse, writers are judged by their audience, and these evaluations encompass both the text and the authors themselves (Cox et al., 2017;Figueredo & Varnhagen, 2005;Authors, 2017;Vignovic & Thompson, 2010). Moreover, such assessments are consequential. Outside of school, employers may make hiring decisions based on writing skills or personal characteristics 'revealed' through one's writing (Hoover, 2013); teachers might make judgments about students' capabilities or conduct (e.g., Johnson & VanBrackle, 2012); and individuals may even make decisions about potential roommates (Boland & Queen, 2016). Although readers' perceptions of writing errors can be inaccurate, biased, and unfair, the impact of those perceptions cannot be disregarded.
The current paper follows on prior research (Johnson et al., 2017) to further investigate relationships between college students' judgments of writing and author as a function of perceived writing errors. Unanswered questions pertained to whether evaluations of writing (RQ1) and author (RQ2) represented holistic or nuanced judgments, the degree of overlap between writing and author assessment constructs (RQ3), and the influence of error patterns on observed latent constructs (RQ4).
Current results suggest that students tended to form holistic impressions of writing quality and authors rather than distinct judgments about individual traits. Analyses consistently generated models wherein most variables loaded on relatively few factors that explained over half the variance (59-74%). The first factor included all or most writing traits, and the second factor comprised multiple author characteristics. Student raters appeared to make holistic judgments rather than nuanced judgments. In practice, college students did not make distinct evaluations of conventions, persuasiveness, word choice, and so on. All of these variables were likely considered together to form a collective evaluation or a subset of traits dominated the perception of all others.
Results also suggest that there is subtle overlap in judgments of writing and author. In particular, the dominant construct in all models-tentatively labeled Writing Quality and Skill-tended to incorporate all or most writing traits along with several 'intellectual' author traits. Assessments of whether an essay was well-written were conflated with perceiving authors as smart or informed. A second construct, Author Personality, seemed to focus on writers' perceived 'interpersonal' traits (e.g., generosity, kindness, and loyalty) with little to no contribution from writing quality. In contrast to intellectual abilities, interpersonal qualities seemed to be judged separately from writing quality. In all cases, however, these two constructs were significantly and positive correlated (rs > .60). Thus, although the two judgments are separable, they very likely influence each other. In accord with prior research, higher writing quality may lead students to perceive their peers as more giving, friendly, or trustworthy; and favorable beliefs or biases toward the author based on the content of their essay may lead to more forgiving review of the essay.
A subtle influence of error pattern was perhaps observed on latent judgments of writing and author. The presence of higher-level writing errors seemed to trigger more sweeping (or less focused) judgments of author personality and to attenuate the connection between writing quality and intellectual traits. This effect was most noticeable for essays that exhibited only higher-level errors of disorganization, missing arguments, and illogical evidence. In this case, the Author Personality construct comprised Generosity, Hard-working, Kindness, Knowledgeability, Loyalty, and Thoughtfulness, whereas for other error patterns the contributing variables focused on interpersonal traits. One possibility suggested by the data is that higher-level errors were not penalized to the same degree as lower-level errors. When rating such essays, higher-level errors may have been perceptible but less salient, which also reduced the salience of personality judgments. Consequently, student raters provided more global or vague evaluations of the author.
There were several limitations to the current study that should be addressed in future research. First, the sample size was relatively small for conventional EFA analyses. Although the within-subjects design was strong for assessing the effects of different error patterns on student raters' perceptions (i.e., the purpose of the original study, Johnson et al., 2017), a sample of 70 participants was somewhat low for conducting EFAs.
Multiple checks of data adequacy for EFA were implemented, including the KMO statistic, number of variables with communalities greater than .60, and number of factor loadings per factor. Importantly, for all analyses, these checks indicated that the sample was acceptable. Nonetheless, it would be worthwhile for future modeling to build on current findings using larger samples. Future studies also represent an opportunity to recruit more diverse samples or introduce other manipulations (see below) to further explore perceptions of errors, writing quality, and authors.
Other limitations pertained to the essay stimuli. First, for consistency, all essays were argument essays constructed in response to the same prompt about 'patience.' This design did not afford testing for promptbased effects, such as whether certain topics draw students' attention to technical aspects of writing quality or to characteristics of the authors. Similarly, this design does not permit exploration of genre effects. For example, an argument essay about the value of patience is likely to seem more personal in nature than an expository text about a scientific phenomenon. Thus, judgments about authors' personality may have been more salient in this study. In future work, it will be useful to manipulate the genre of the essay stimuli (e.g., argument, expository, or narrative) along with the presence or absence of self-references (e.g., first-person pronouns and anecdotes).
It is worth noting, however, that prior research on perceptions of writing and authors have been conducted with a variety of writing types ranging from formal classroom assignments to informal emails.

Implications for Peer Assessment
Given that evaluations of writing, errors, and authors can be conflated, one set of implications for peer assessment of writing pertains to how these effects might be mitigated or under what conditions they are exacerbated. Students are often skeptical of peer assessment and express doubts that their peers are capable of performing reliable and valid assessments, particularly when course grades are at stake (Gielen, Peeters, Dochy, Onghena, & Struyven, 2010;Kaufman & Schunn, 2011;van Zundert, Sluijsmans, & van Merriënboer, 2010). Students may be worried that their peers are forming personal or intellectual judgments and then biasing their reviews and scores based on these judgments. The results of the current study suggest that this is a plausible concern. If students recognize that they are making such judgments about other students, a reasonable interpersonal inference is that their peers are doing the same (see Panadero, 2016;van Gennip, Segers, & Tillema, 2009).
Masked review policies are often instituted to avoid possible bias or unfairness in peer assessment (e.g., Kaufman & Schunn, 2011;Panadero & Alqassab, 2019) and scholarly publishing (e.g., Lee, Sugimoto, Zhang, & Cronin, 2013). In principle, by obscuring the identities or backgrounds of their peers (e.g., name, race, gender, or nationality), student assessors cannot use this information to offer biased assessments or interpretations. However, the essay rating task employed in this study was effectively blind-student raters were given no information about the supposed authors-yet text and author judgments were still overlapping or strongly correlated. Thus, 'blind review' did not solve the problem of making personal judgments about authors based on perceived writing errors.
A more direct approach may be to provide additional training to students that counteracts unwarranted inferences about their peers (e.g., Goodwin, 2016;May, 2008;Soltero-González, Escamilla, & Hopewell, 2012) and improves writing assessment literacy (see Crusan, Plakans, & Gebril, 2016;Weigle, 2007). Traditionally, expert raters are trained and assessed based on inter-rater agreement (e.g., Huot, 1990;Jonsson & Svingby, 2007), yet such agreement does not guarantee a lack of bias. Instead of disregarding inferences about author characteristics and errors, high agreement could simply indicate that raters are making similar interpretations (e.g., conflating writing skill and intellectual ability). Thus, training that explicitly tackles correspondence bias or other social-perceptual biases (e.g., May, 2008) may be particularly beneficial for student raters who lack writing knowledge or proficiency. For instance, training focused on perspective-taking (i.e., considering the perspectives of others in terms of point of view, location, or time) has been shown to reduce the occurrence of the fundamental attribution error among adults (e.g., Hooper, Erdogan, Keen, Lawton, & McHugh, 2015). One avenue for future research may thus be to incorporate perspective-taking exercises into peer assessment training. Students can be taught to be more mindful of how written products may not reflect the true circumstances or identity of the writer (e.g., frequent typos may reflect writing under high time pressure rather than 'laziness' or 'lack of intelligence').
In addition, research on rubrics has shown that they can improve peer (and self) assessment validity and reliability (Jonnson & Svingby, 2007;Panadero, Romero, & Strijbos, 2013;Panadero & Jonsson, 2013). In a recent meta-analysis, students who were trained to provide ratings demonstrated greater learning gains than those who completed the peer assessment without training (Li et al., 2019). Notably, the current study did not employ detailed assessment rubrics or rubric-referenced training. Participants were provided with brief descriptions of eight writing traits and eight author traits (see Table 1), but were not given detailed criteria or benchmark examples. This method was implemented because the aim was to assess perceptions of errors rather than adherence to a rubric or checklist. However, although the traits and terms were fairly straightforward, participants likely possessed differential understanding of the concepts (e.g., epistemological beliefs about 'intelligence' and 'knowledge' or personal experiences with 'generosity' and 'kindness'). In future research, a plausible hypothesis is that rubric-based training would result in more distinct trait judgments-the underlying factor structure might exhibit a larger number of latent assessment constructs rather than a few holistic constructs. It is unclear whether such training would reduce, exacerbate, or have no effect on the occurrence of personal author judgments or the connections between writing and author evaluations. To further explore these outcomes, rubric-based approaches might be further enhanced via concrete strategies for avoiding personal judgments about authors when assessing writing or writing errors. Rubrics and exemplars could not only clarify the meaning of 'conventions,' 'sentence fluency,' 'loyalty,' or 'creativity,' but also establish criteria for when judgments about such traits are (or are not) warranted.

Conclusion
Writing skills are critical for success in academic, professional, and social settings. Although a great deal of attention is paid to teaching writing and evaluating writing products in reliable and valid ways, current research suggests that focus should also be directed to underlying relationships among perceptions of writing and writers. Moderate to strong links were observed between ratings of 'writing quality and skill' and 'author personality,' and these relationships were strengthened in the presence of perceived writing errors. It makes sense that writing errors could or should have a valid impact on writing quality judgments. However, the spillover onto perceptions of authors' personal characteristics may be representative of latent biases, perhaps stemming from differences in education, identity, culture, and so on. As the stakes for writing performance increase, it is important for assessors and policymakers to take steps to recognize and mitigate these effects.