With peak submission season for machine learning conferences just behind us, many in our community have peer-review on the mind. One especially hot topic is the arXiv preprint service. Computer scientists often post papers to arXiv in advance of formal publication to share their ideas and hasten their impact.
Despite the arXiv’s popularity, many authors are peeved, pricked, piqued, and provoked by requests from reviewers that they cite papers which are only published on the arXiv preprint.
“Do I really have to cite arXiv papers?”, they whine.
“Come on, they’re not even published!,” they exclaim.
The conversation is especially testy owing to the increased use (read misuse) of the arXiv by naifs. The preprint, like the conferences proper is awash in low-quality papers submitted by band-wagoners. Now that the tooling for deep learning has become so strong, it’s especially easy to clone a repo, run it on a new dataset, molest a few hyper-parameters, and start writing up a draft.
Of particular worry is the practice of flag-planting. That’s when researchers anticipate that an area will get hot. To avoid getting scooped / to be the first scoopers, authors might hastily throw an unfinished work on the arXiv to stake their territory: we were the first to work on X. All that follow must cite us. In a sublimely cantankerous rant on Medium, NLP/ML researcher Yoav Goldberg blasted the rising use of the (mal)practice.
In particular, he excoriated a paper from the prominent MILA research group which purported to have adapted the methods of generative adversarial networks to language. His gripe was that the language generation was laughable and actually far worse than any current technique. The authors, he surmised (and many I’ve spoken with agree), were staking their territory so that regardless of who first succeeds, they would need to be cited as the originators of the idea.
Amid this tumult, some have questioned the very enterprise of citing preprinted articles. So, if the arXiv may be subject to abuse, do I have to cite papers that have only appeared on the arXiv?
Yes, of course. Any time that our work follows, copies, or borrows ideas from other people, and when we can reasonably be expected to be aware of this, we ought to cite the related work.
A large number of seminal works have never been published. The greatest mathematics paper of our lifetimes remains unpublished. Not every paper on the arXiv warrants a bibliographic entry, but many do. The idea that unpublished status would categorically exclude the responsibility of citation is a bit preposterous. It puts far too much faith in the deeply flawed fraternity of conference organizers and the overworked cohort of peer reviewers, roughly 30% of whom typically fail to even comprehend the basic outline of the paper.
If similar work comes to our attention during a proper literature review, we ought to cite it. If we knowingly build on someone else’s work we should cite it. If someone shares a non-obvious idea with us that develops into a paper, we should find some way to credit them. If someone writes a theory down on a napkin shortly before dying and it turns out to open a new subfield of machine learning to scientific inquiry, we should convert the napkin to a pdf, upload it to arXiv, and then cite it.
We should not have to cite nonsense. Many reviewers are abusing the system and asking for ridiculous comparison to recently-posted preprint papers. Bald-faced flag-planting should not be rewarded. And we should not be faulted by reviewers for failing to compare against 2-week old algorithms that may or may not work. But the very idea that arXiv papers would in general not need to be cited puts far too much faith in the fraught process of scientific publication and far too little importance on ideas themselves.