Putting values in practice: checking citation equity in my own dissertation

a pile of artfully curled up papers

My file drawer is simultaneously less and more messy than this. Photo via Unsplash.

In recent years, there’s been more and more discussion on how there are large disparities in who gets cited and who doesn’t in large swathes of scientific literature. While some of the existing studies trying to quantify the exact scope of the problem have some serious methodological issues (which I’ll get into below), it’s clear that, on average, women and other gender minorities are cited less frequently, and scholars in the global south (yes, not a perfect term, but better than some of the other terms available) are less likely to be cited than their peers in western Europe, the US, Canada, Australia, and other high-income countries.

There’s serious ramifications for this. Citations are key for researchers to establish their place in their field and to build their careers, and undercitation due to gender, racial or ethnic identity, language, and the intersections of these factors is one of the factors potentially leading to these people being pushed out of research. As people from majoritarian backgrounds are more likely to cite other majoritarians, this means that those of us who are in these relative positions of privilege need to be actively examining who we cite and seek to diversify the references we choose to use.

If you know me even in passing, you know that I care deeply about justice and equity in STEM. I try to do what I can to center the work of researchers from historically excluded groups. But it’s one thing to talk the talk: when it comes to my own work, am I living up to my own values? I’d like to think so, but that’s not enough. Let’s take a long, hard look at my own work in the form of my PhD dissertation, and see how I’m doing.

I cited a whopping 1,620 authors from 543 papers in my dissertation. I’m not sure how this really compares to similar dissertations in my field, but I know it’s likely on the high end— one of the pieces of feedback I get frequently is that I cite, like, so many papers, calm down. Part of this is the result of an industrial-strength case of imposter syndrome— look, please believe what I’m saying, here’s like twelve other people saying similar things!— but part of it is a conscious effort to cite researchers who are less likely to be recognized for their work in favor of more well-known papers by people from more majoritarian backgrounds. But the issue with that is that then reviewers ask “well why didn’t you cite this more obvious paper everyone cites?”, so soon I have two or three (or, err, eight) citations where some people would stop at one. Is this a good or bad thing? Honestly I don’t know— it can make it harder to read, but recognizing the work of others seems like it’s more important?— but in any case, it’s what we’re working with here.

For each individual author, I tallied how many papers they were on as 1) first author, 2) co-author, and 3) total. I checked the list for duplicates (such as publishing under initials sometimes but full name other times, name changes, et cetera). I also recorded the country of affiliation listed on the paper at the time of publication, using the most frequent one for authors who changed affiliation across paper. This obviously isn’t perfect, but the goal was to look at which countries were over- and underrepresented.

My dissertation focuses on the Neotropics, and the legacy of colonialism means that scientific infrastructure is often under-supported in these countries. Even when researchers from those countries are studying the organisms and ecosystems of their homelands, they may have to do so from universities abroad. Despite this, many institutions in these countries are doing amazing work, but it isn’t read as widely due to the high costs of publishing in high-impact journals, language barriers, or (often racist) assumptions that the science being done in these places isn’t as good. Failing to cite this work only perpetuates that cycle, as a recent preprint explained. Because of this, I was particularly focused on whether I’d given the work of Latinx authors, both those working in Latin America and those based abroad, the credit it deserves, or whether I’d unwittingly just done the same shitty things previous white ornithologists have done.

The next step was looking at the gender and racial disparities in who I cited. This is harder to figure out. Many papers that look at this do so by just inferring it based on names, but this is a seriously flawed approach. In the case of gender, not all names are obviously gendered, as all my friends named Alex, Jordan, and Robin can attest. Names also may be gendered differently in different cultures, or not scan as gendered to those outside of a specific culture. Finally, people who are transgender or gender-nonconforming (TGNC) may not be recognized as such by this approach. I publish under my full name, Jessica F. McLaughlin, which most people would categorize as female without a second thought, but that completely missing that actually I’m nonbinary. Inferring race or ethnicity by names can be similarly fraught, since many names are in wide use by people in many groups, and may be changed through marriage, adoption, and forced assimilation.

These two factors meant that there was no “easy” way to figure out these variables, and I instead had to manually search for each individual author. Yes, this took about as much coffee as you would assume. For some people it was easy— a scientist with a trans pride flag on their social media and a bio that states their identity is pretty easy. But people who are deceased or less active online are more difficult. I wish I could say I had some sort of foolproof trick, but all I can say here is I did my best, and any mistakes are fully my own. For gender, if there was no clear evidence that someone identified as other than their gender expression indicated in photos and online presence, I recorded it as what it appeared to me, with the knowledge this likely missed people who are closeted or otherwise not open about a TGNC identity.

Racial and ethnic identity was a thornier issue that I struggled with how to categorize. Many people may pass as white that actually identify otherwise. People can also hold multiple identities simultaneously, such as being Indigenous, Black, and Latinx. Humans are complicated, after all. Again, self-descriptions were the most surefire way, and taken as final word when they were available. In other cases, I again just did the best I could. This means that unfortunately my data likely undercounts people with multiple racial and ethnic backgrounds and infers them to be a part of majoritarian groups.

And what should those categories be anyways? Some categories that are commonly used lump a huge amount of people who face very different challenges together— think of how “Asian” often is used to mean anyone from China, India, Iran, or Lebanon, with Pacific Islanders often lumped in for good measure. Similarly, people from the Middle East and North Africa are often classified as white, completely erasing the particular prejudice they face. On the other hand, splitting into more categories makes it more likely that I’ll screw up. In the end, I settled on somewhat broad categories based on the many many discussions I'd read and listened to. I separated south Asia (India, Pakistan, Sri Lanka, etc), east Asia (China, Japan, SE Asia), and Pacific Islanders into their own categories to try and capture some of the complexity that gets erased by aggregating those very distinctive regions together. I considered Middle Eastern, North African, and Central Asian together, despite the oversimplification inherent in that, because the additional discrimination of Islamophobia (whether or not the person in question is actually Muslim, bigots not usually letting facts get in the way) is likely to affect them.

The final challenging category was one particularly pertinent to this case— Latinx authors. Usually this is handled as a separate question from race— think of the additional checkbox usually seen on forms— but I wasn’t sure if that was the best approach here. While there are huge issues within Latinx communities of colorism and racism, I’m a random white person, and trying to figure out exactly how someone identified within that seemed way outside my lane. In general, if someone identified as both Latinx and some other identity— Black or Indigenous being the most common— I logged them as both, but if I couldn’t tell or had no additional information, I used just Latinx. While again this is imperfect, one of my key questions was whether I was citing authors from Latin America— the places my research focuses on— and this seemed like the best way to approach that. Is it the right way? Likely not, but I hope it’s the least harmful way.

So what did I find?

The breakdown of the ten authors who I cited at least ten times: gender, race/ethnicity, country of affiliation at time of publication, number of first authored papers, number of non-first-author papers, and total number.

Let’s start with who I cited the most. There were 10 authors I cited ten or more times. Unsurprisingly, my MSc advisor and PhD advisor were both in this list, at number 1 (18 citations) and tied for third (12) respectively. Two other patterns immediately stand out though: of these ten, 6 are affiliated primarily with a US institution, 2 UK, 1 Canada… and only 1 outside of these three countries, Argentina. Similarly, 9 of these ten are (best as I could tell) white, and all ten are men.

Okay, so not off to what I’d call a great start, per se. All of these guys have done really important work— I mean, obviously, given that I’ve worked directly with two of them and cite the hell out of the others— but how far do we have to go to hit a non-dude?

Same as above for columns, expanded out. Row 28 is me.

Turns out we have to scroll down to the ninth-place spot, where we finally hit…me, citing myself, in exactly one-third of the nonbinary authors on the list. We also finally hit our first woman at this point, as both of us have 6 citations. For better or worse, the authors my own work is most in conversation with are primarily white dudes— the researchers with the work in many ways closest to mine.

But let’s step back now and look at the list as a whole. The most cited researchers are going to tend to be more established people with long lists of publications— the part of the field where representation is particularly lacking. What about the other 1600 or so people I cite?

Country of affiliation at time of publication for all authors, organized by continent/region with the colors of the slices corresponding with the map colors. Darker colors indicate more authors affiliated with institutions based in that country.

Affiliation country:
I cited authors from 58 countries. The largest single group was, likely unsurprisingly, from the US, with 645 authors. The UK, Canada, Sweden, Germany, and China all had more than 50 authors. Almost half of the authors were from North America, Central America, and the Caribbean, driven largely by the preponderance of US authors. Europe unsurprisingly follows, although authors were somewhat more evenly distributed across countries— the UK was an outlier at 157 authors, but many of the countries included had more in the 20-60 author range. Oceania’s presence is almost entirely driven by Australian institutions, and Asia’s is dominated by China and Japan (the latter of which is only as high as it is due to some of the software I use, and is driven by just a few papers with many authors).

I was trying to particularly cite authors from Latin America, since that’s where the systems I study are located and these authors are less frequently cited even when publishing in the same journals as those in the Global North. I cited 153 authors based in South America, with an additional 73 in Central America and the Caribbean. This seems like a lot at first glance, but it’s only barely 10% of the cited authors. I’m not sure how it compares with the averages in my specific subfield (evolutionary biology of Neotropical birds), but I was honestly expecting it to be higher given the fact I consciously made an effort to seek out papers by these authors. This seems like a pretty striking example of just how dominant the Global North is in the literature of many fields.

I’m very troubled by how few authors I cite from Africa and Asia— while it would be easy to write that off as because I’m working on Neotropical systems, and there’s not going to be as much literature on that from elsewhere, that’s letting me off the hook too easily. I only cite 3 authors from Africa even though it has the largest amount of tropical landmass in the world? That’s bias in action. It would also be easy to say that perhaps the African authors I cite are simply based elsewhere, especially Europe, but the data I collected on racial and ethnic diversity doesn’t back that up (see below). Similarly, there’s excellent work on tropical birds in the tropical regions of southern Asia, often from local authors, yet I only cite 3 authors based in India and Malaysia, 2 from Indonesia, and 1 each from the Philippines and Taiwan, and cite no authors based in Sri Lanka, Nepal, or any of continental South East Asia. What insights into tropical evolution am I missing by not including this literature? This seems like a particularly striking example of how even though to me personally equity is in itself the main driver for why we need to change our citation practices, the science we’re doing is also shaped by these choices, consciously or unconsciously.

Another easy explanation to fall back on is the language of publication— that the underrepresented regions are publishing in languages besides English. But, again, that’s not, in my mind, something that lets us off the hook. First of all, only needing to speak one language and doing science solely in one’s mother language is a privileged state of affairs, and a pretty US-centric viewpoint. I read Spanish well enough to get through a scientific paper, and between that and my years of Latin in high school mean that I can make it through papers in French and Portuguese as long as I can Google things. Even my German is decent enough for it, since really scientific vocabulary in all of these is so largely Latin in origin that I can manage (just don’t try to actually have a conversation with me in any of them, colloquial speech being a whole different ballpark). Failing those, we live in an era of relatively easy translation, through both automatic translation and through the easy global connections enabled by social media. Despite this, I cite only 2 papers in German and one in Portuguese— frankly embarrassing, especially considering that I don’t cite anything written in the primary language spoken where my study systems are. Language is definitely a contributing factor, but it’s not an excusing one— it’s a known factor in driving citation bias, and one I should have done better on.

Chart of author gender, with percentage of total and number of each. “M” and “F” are self-explanatory. “NB” indicates nonbinary or other gender identity. “U” is unknown— usually couldn’t find records of the author, or found conflicting records that may or may not be the same person. “NA” indicates authorships by institutions or consortia.

Gender:

I was able to find or infer to some level of certainty the gender of all but 39 authors. An additional four were groups such as research institutes or consortia, for which this was not applicable. I had planned to categorize binary trans authors as their appropriate gender but with an additional indicator— not because a trans woman author isn’t as much a woman as a cis woman, but because I wanted to track representation of TGNC researchers in particular. However, I couldn’t find that I had cited any such researchers. It’s possible some of the uncategorized researchers were people who transitioned at some point after publication, leading to me not being able to track them, but regardless, this seems to reflect the overall exclusion of TGNC researchers in STEM. Ultimately, as far as I am aware, I cited a grand total of 3 TGNC authors, all of whom are nonbinary and one of whom is myself.

But even in terms of presumably cis authors, I’m not thrilled with how I did. Over 1,100 of the authors I cite are men— 73.3% of the total. To put that in perspective, 53% of the biology doctorates earned in the US in 2021 were by women. Obviously, that’s just one career stage, where the authors i cite include everyone from undergraduates through professors emeriti, but citation inequity has been brought up as one of the many reasons that attrition rates for non-men are higher through the course of an academic career. I know and work with so many amazing women specifically working in avian evolutionary biology, and yet they’re not even a quarter of the people I cite.

Breakdown of the race/ethnicity of cited authors. Note that as some authors hold multiple identities, the sum of the data on this one is greater than the 1,620 people cited. Groupings as described above, with NA further representing organizational citations and unknown for those I couldn’t find info on or wasn’t sure on.

Race/ethnicity:

When it comes to race and ethnicity, my results are still disappointingly in line with broader trends. The majority of the authors I cite are white. I do seem to have cited a relatively large proportion of Latinx authors compared with other recent evolutionary biology papers, but it still seems low to me considering, once again, literally all of my dissertation work was on Panama and the Andes.

The most startling result to me was the lack of Black and Indigenous authors. As far as I was able to tell, only 9 of the authors I cite are Black, and 2 are Indigenous. It would be easy to say that maybe these folks are overrepresented among the authors I was unsuccessful in finding information on, or were more likely to be miscategorized, especially those who are also Latinx. Maybe that’s true, but frankly, that lets me off the hook way too easily. If we’re truly invested in working towards greater equity and in centering researchers from historically excluded groups, they can’t be rounding errors. I dropped the ball on this one, and I didn’t even realize how poorly I’d done until I started collecting the data.


There are some potentially big caveats with everything I’ve outlined above. Firstly, I couldn’t find information on everyone. This particularly effects authors who were only on a single paper, often undergrads or people who left research (who are more likely to be part of underrepresented groups). This was particularly compounded for authors for whom the available information was not in English or Spanish— while I could sometimes piece together and confirm I had the right person in these languages, it was far more difficult to do this for languages I don’t speak. Second, even for those I did find, I almost certainly made errors in recording, especially for TGNC people who present as a cis binary gender and for people with multiple ethnic and/or racial identities.

There’s also the inherent issues with how much variation there is within single racial and ethnic categories, and how bias and discrimination within those broad categories is missed by the approach I took. This was especially the case for Latinx authors, as I discussed above, but even within other categories, there’s complexity that this sort of approach misses. Chinese and Vietnamese authors, for example, would both be included in the same grouping, but there’s likely huge differences in experiences and access to scientific resources and opportunities— and authors of those same backgrounds but from diaspora communities in the US or Europe would have again very different experiences. Even within an overall very privileged group like white researchers, those based in places such as Eastern Europe aren’t going to be able to access the same resources as those in the UK or Sweden. I debated how fine-grained this analysis should be to account for those, but in the end I settled for the broader categories, since frankly it became clear that even at that scale I had some serious issues to reckon with.

There’s finally how my own positionality impacts this endeavor. I’m a white researcher from the US. Am I even the right person to be making these sort of assessments? Honestly, probably not, at least in the academic context of “this is a thing I am going to research”. At the very least, if I were doing this as a full-fledged study, the responsible thing to do would be collaborate with others who do have the relevant lived experience and expertise. (And if this ever turns into anything other than me posting about it on the internet, rest assured I will do so).

But doing something as a research project is different than trying to develop a practice of self-accountability. I may be posting this publicly, but it’s fundamentally me evaluating a small slice of my efforts at increased equity— and identifying the glaringly obvious areas where I’ve missed the mark. As that sort of effort, it seems like actually that effort is very much incumbent on authors with one or more majoritarian identities to do and take seriously, instead of pushing off onto marginalized scholars who have their own work to do and aren’t just free diversity consultants. It’s my responsibility to put in the work to find out where I’ve fallen short, and I encourage others to do the same.

I’m honestly not overly impressed with how well I did in my personal citation practice. It’s hard to compare against existing studies, since they typically are looking across many papers for trends instead of breaking down the citations in a single body of work, but even without comparing to the averages in my field it’s easy to spot troubling issues. Despite trying to make a conscious effort to cite authors from groups historically excluded from science, I’m still citing disproportionately white men from the Global North. The lack of Black and Indigenous authors is a particularly big issue that I need to more actively address going forward, and I need to be more proactive in including non-English literature. It was admittedly uncomfortable to realize that I hadn’t done as well as I would have hoped, but from this exercise, hopefully I can develop a more equitable and just approach to citation in the future.

References

Bertolero et al. 2020. Racial and ethnic imbalance in neuroscience reference lists and intersections with gender. bioRxiv.

Cite Black Women Collective

Di Bitetti and Ferreras 2017. Publish (in English) or perish: the effect on citation rate of using languages other than English in scientific publications. Ambio 16:121-127.

Diversify EEB

Meneghini et al. 2008. Articles by Latin American authors in prestigious journals have fewer citations. Plos One.

Soares et al. 2022. Neotropical ornithology: reckoning with historical assumptions, removing systemic barriers, and reimagining the future. ecoevorxiv.

Teich et al. 2022. Citation inequity and gendered citation practices in contemporary physics. Nature Physics 18:1161-1170.

Zurn et al. 2022. Supporting academic equity in physics through citation equity. Communications Physics 5:240.

Jess McLaughlinComment