Additionally, while there are several studies analyzing traditional letters of recommendation for language variation between genders, there is a gap in the current literature in analyzing standardized letters of recommendation. Previously, our research team published a study in Academic Emergency Medicine Education and Training that showed minimal differences in language use between genders in evaluating 237 SLOEs from applicants invited to interview to a single academic EM residency for the 2015-2016 application cycle.The small dataset, and potential for a homogeneous sample , prompted the current investigation with a goal of confirming or refuting the original results with a larger dataset. The choice to include all applicants was made with a goal of potentially increasing the variability in the language used within the SLOE . The aim of this study was to compare differences in language within specific word categories to describe men and women applicants in the SLOE narrative for all applicants to a single academic EM residency program for the 2016-2017 application cycle. We secondarily sought to determine whether there was an association between word categories’ differences and invitation to interview, regardless of gender, in order to better contextualize the possible importance of wording differences.We used descriptive statistics to report the applicants’ characteristics and assessed for differences in baseline characteristics by gender using t-tests and chi-squared tests, as appropriate. Median word counts for the identified 16 categories of interest were reported. For the primary outcome of interest, we assessed differences by gender in word counts after adjusting for letter length using Mann-Whitney U tests. In secondary analysis,rolling benches hydroponics the analyses were repeated for differences in word categories by invitation to interview.
We used multi-variable logistic regression to identify word categories associated with receiving an invitation to interview. Covariates in this model were selected via a predetermined inclusion threshold of α = 0.10. We performed all analyses using Stata 13.1 . Additionally, for any of the seven user-defined word categories in which a difference was noted, a further analysis was conducted evaluating the use of each individual word in the dictionary to assess if the difference for the category was driven by the use of a single word , or by the use of multiple descriptors within the category. For this analysis, the proportion of SLOEs with each word included was compared by gender using Fisher’s exact test. This analysis was not conducted for any differences in the LIWC defined categories due to the size of the word dictionaries .This analysis found small but quantifiable differences in word frequency between genders in the language used in the SLOE. In this study, differences between genders were present in two categories: social words and ability words, with women having higher word frequency in both categories. Our prior investigation found differences of similar magnitude in affiliation words and ability words, with letters for women applicants having higher word frequency in both categories. For both studies, the differences in word frequency were statistically significant, but it is difficult to comment or draw conclusions about the significance of these small wording differences on application or educational outcomes. What is perhaps more notable than the presence of differences in two categories is the lack of difference in the remaining 14 categories. When looking specifically at the categories that had gender differences, our finding of ability words being used to describe women applicants more frequently than men applicants is in contrast to previous studies, while our other research finding, that women are more frequently described with social words than men, is in alignment with previous studies. In the medical literature, letters of recommendation for men applying for faculty positions contain more ability attributes such as standout adjectives and research descriptors than letters for women,and letters for women in medical school applying for residency positions are more frequently described by non-ability attributes such as being caring, compassionate, empathetic, bright, and organized.Looking specifically at ability words, this word category had statistically significant differences in both this investigation and our prior study, with ability words occurring more frequently for women than men.
Ability words include descriptors such as talented, skilled, brilliant, proficient, adept, intelligent, and competent. This consistency of findings between the two samples suggests that letter writers employ multiple descriptors within the ability category to convey proficiency of women applicants. However, the reason for this difference is unclear. Notably, the word “bright” is one of the ability words for which there was no gender difference found, counter to findings from prior research wherein women applicants were more often described as bright.While the descriptor “bright” is often considered a compliment, it has also been suggested that its use “subtly undermines the recipient of the praise in ways that pertain to youth and, often, gender” stemming from its association with the phrase “bright young thing.”The finding that women were more frequently described with social words aligns with previous studies of letters of recommendations. Studies in letters of recommendation for psychology and chemistry faculty positions have shown that women are often described as communal , while men are described as agentic and have more standout adjectives .Other studies have found women to be described as more communicative.We employed a secondary analysis with respect to the invitation to interview to determine if small differences in word categories were associated with invitation to interview. The adjusted analysis showed an association between more standout words and invitation to interview; however, this analysis did not account for other factors that may influence invitations to interview . Although these findings represent an association and not causation, they help to contextualize the potential importance of small differences in word use, although this is not conclusive. Notably, neither social words nor ability words influenced the choice to interview, and there was an equitable frequency of standout words between genders. Despite the small word differences in the categories of social and ability words, we did not find a difference in the 14 other word categories queried.
There are several possible explanations for this lack of a finding. It is possible that the sample was under powered to detect small wording differences in the 14 word categories.The short word format of the SLOE and specific, detailed instructions as noted above may reduce bias. Other explanations include the increasing use of group authorship, which may introduce less bias than individual authorship. In 2012, a sampling of three EM residencies calculated that 34.9% of SLORs were created by groups.In 2014, 60% of EM program directors participated in group SLORs, 85.3% of departments provided a group SLOR, and 84.7% of PDs preferred a group SLOR.Although the sample size and lack of a standard comparator limit the ability to determine why we did not find a difference for the majority of word categories, we hypothesize that it is related to the format and hope to further support that hypothesis through future work examining paired SLOE and full-length letters for candidates. A recently published study by Friedman and colleagues in the otolaryngology literature has been the only study,hydro tray in addition to our own, to our knowledge that evaluates a standardized letter for gender bias. In this 2017 study, the SLOR and more traditional NLOR in otolaryngology residency applications were compared by gender, concluding that the SLOR format reduced bias compared to the traditional NLOR format. Although in both letter formats some differences persisted , the SLOR format resulted in less frequent mention of women’s appearance and more frequent descriptions of women as “bright.”Although their analysis strategy differed from the one we used in this study, their findings parallel ours in that there are minimal differences by gender in a restricted letter format and highlight the need for further study of the how the question stem and word limitations may be intentionally built to minimize bias. Lastly, of note, our study focused specifically on differences in language use in the SLOE. This study does not evaluate the presence or absence of gender bias in the quantitative aspects of the SLOE, nor does our multi-variable model include other factors that would influence the invitation to interview such as rotation grades, test scores, school rank, or AOA status. Such analyses were beyond the scope of our study, which was focused on the SLOE narrative itself. Other studies have evaluated this but have not evaluated the narrative portion of the SLOE.Additionally, there remain many other forms of evaluation, numerical and narrative, in medical training, in addition to the SLOE that have analyzed gender bias. Recent studies have suggested that bias persists in other forms of evaluation. Specifically, Dayal and colleagues’ recent publication notes lower scores for women residents in EM Milestones ratings compared to male peers as they progress through residency.Evaluations of narrative comments from shift evaluations are another area of interest, of which we are aware of two current investigations underway in EM programs. Additionally, a study of evaluations of medical faculty by physician trainees by Heath and colleagues also showed gender disparities.As this body of literature continues to grow and interventions are developed to minimize bias in all narrative performance evaluations, we believe it will be important to think carefully about the question stems and response length allowed. Unfortunately, limiting space may also limit the room for positive evaluation and strings of praising adjectives.
However, while implicit bias exists, employing limits in response format may rein in the manifestation of implicit bias by focusing the writer.This was a single center study; only SLOE narratives from applicants who applied to interview at a single, academic EM residency program were included in analysis, and applicants from non-LCME schools were excluded, limiting generalizability. The man to woman applicant ratio in this study reflects the national trend for the 2017 match, which may contribute to generalizability.ERAS does not allow an individual program to access SLOEs for applicants who have not selected that program; therefore, a full national sample of all applicants in a single year to ERAS was not feasible. Our analysis used the LIWC linguistic software and focused on individual words. Other approaches, such as qualitative content analysis or focusing on phrases or searching for specific words as was done by Friedman and colleagues in the study discussed above may have yielded different findings. Additionally, the LIWC contains pre-established word lists. While these lists have been used in medical literature,it is possible that there may be a set of words for EM that is more applicable. Our analysis used word frequency as a measurement of biased language and did not evaluate context of the words in the letters, limiting the study. Words in different contexts can have different meaning. For instance, the word “aggressive” can have both a positive or negative connotation based on context when describing and applicant as “aggressive in picking up patients” vs “aggressive with consultants.” A qualitative analysis of the SLOEs would better delineate the context of word phrases and provide a more in-depth analysis. Although it is a limitation that we did not evaluate word context, word frequency software applied to a large sample gives generalizability that a small qualitative analysis may not be able to achieve. In these rare instances of context misinterpretation for positive and negative emotion categories , this may be of little overall consequence as there is such a large margin between median positive vs negative words within these word categories . Additionally, the subtle differences between word phrases such as “we strongly recommend this student” vs “we will be recruiting this student” would not be picked up by the software. This was an exploratory study and as such was not powered to a specific outcome; however, we estimated that with our sample size of 822 that we would have 80% power to detect a difference of 0.2 mean words within a single word category with a 5% type I error . Additionally, it is possible that the sample was under powered to detect small wording differences among the 16 word categories, which could represent a type II error. The analysis for differences in 16 categories raises the question of the multiple comparisons problem.