(very John Oliver voice) Good evening. Tonight’s top story: Web scraping. That’s when someone uses an automated tool to download, typically, large amounts of information from the Web and save it on their own computer for their own purposes. Now, there are situations where this is undoubtedly for the general good, like saving climate data so that it can’t be made to vanish with an act of government whimsy. But when personal information enters the picture, the ethical considerations can change, and there can be times when “it was available to the public!” becomes little more reassuring than, “Yes, I am following you while I happen to be carrying this camera, but you were walking outside, so you have to be OK with being seen, oh, and is this your regular bus stop?”.
Just a few days ago, the New York Times ran a provocative and alarming piece you have quite possibly already seen, titled “The Secretive Company That Might End Privacy as We Know It” — oh, you’ve read it? Thanks for telling us, but don’t worry, Mark Zuckerberg already knows. Basically, the M.O. of Clearview AI was to scrape pictures from, as the Times says, “Facebook, YouTube, Venmo and millions of other websites,” and then use the resulting massive database to fuel the wet dreams of petty aspiring autocrats everywhere.
Now, as with so many of the depressing and horrifying developments of modern life, there is a level on which this feels absolutely unsurprising. Just another way in which the utopian promise of the Internet [stock art of Tron appears over shoulder] was betrayed and perverted into something irredeemably toxic by some guy out to make a quick buck. Back in 1990, when (now Sir) Tim Berners-Lee created the first ever website at CERN, the physics laboratory that now hosts the Large Hadron Collider, surely this disaster was not what he had in mind. Even today, surely we can count on scientists to rise above the venal impulses of the money-grubbers, hold themselves to the highest ethical standards in the pursuit of truth, act with discretion to the communities that their research affects and I’m just fucking with you. Scientists are people. Sometimes greedy, often fallible. And the process of correcting an error, even one due to simple carelessness, can be remarkably painful for all concerned.
I have been involved in writing an open letter in response to what I myself like to call “The Adventure of the Scandalous Cauliflower.” That open letter is available here in PDF and basic HTML. I was not the first person to call attention to this matter, nor perhaps even the loudest, but I like formatting academic documents, so the organizing somewhat fell to me by default. As typically happens in cases where an open letter gets written, everyone involved has their own opinions that may stretch beyond its margins, and I’m sure that I have my own takes (or at least choices of emphasis) that would not be co-signed by all of the letter’s signatories. This blog post is, beyond providing a pointer to the open letter, my attempt to underline that my idiosyncrasies should not be attributed to anybody else unless they have expressly indicated that they share those particular takes of mine.
The very short version is that a group of researchers at the University of Milan hoovered up a large quantity of social-media data without informing any of the communities they were studying, violated the Terms of Service of a community in that set explicitly devoted to scholars and academics, and thanks to a truly impressive feat of analyzing without thinking, concluded that the topic of cauliflower is a serious transgression of their subjects’ social norms.
This is how the letter begins:
We are writing to raise grave concerns regarding the ethics and methodology of “Mastodon Content Warnings: Inappropriate Contents in a Microblogging Platform,” by Matteo Zignani et al. of the University of Milan. The issues with this paper are sufficiently severe that the paper’s dataset has been removed from Harvard’s Dataverse repository. This open letter will explain the background of this removal and urge further action on the part of the paper’s authors, the University of Milan, and the Association for the Advancement of Artificial Intelligence (AAAI), who have published the paper in their conference proceedings. As we detail below, the data analysed in this paper was not collected ethically, failing to take even simple steps to anonymize the data released with the paper, and fundamental errors of methodology make its results irrelevant.
Mastodon is a decentralized, community-operated microblogging platform created in early 2016 by Eugen Rochko and is based on open protocols that allow people to communicate across different servers. Anyone who wishes to create a Mastodon server, or instance, can do so by downloading and installing the Mastodon software. Users who register accounts at an instance can then share social-media posts with other users on that instance as well as with other instances. The interconnection of different servers is known as federation.
Other things that could have ended up in the letter if I had been left to my own devices and if it weren’t already going to be fairly long:
- history of Mastodon’s “Content Warning” feature
- variability in CW practices and discontent caused thereby
- more details on the underlying ActivityPub protocol and its peculiarities
- the ongoing development of fediverse software that isn’t Mastodon
- the paper’s generally rushed, “we need a thing for this conference” feel
- why it’s a good thing that professional codes of conduct can be a less blunt instrument than the law
- the ambiguities fostered when the same software can be used as a means of publication and of social interaction
In short: Lots of indefinitely deep rabbit holes, and opportunities to say “a balance must be struck between the need for X and for Y” — though now that I’ve typed that phrase, I have to wonder how much that mode of rhetoric fuels compulsive centrism. Moreover, these are topics where it would be harder to pull together a core of agreement. I mean, would 45 other fediversians sign on to anything I wrote myself on my own about all that?
One way in which I am perhaps peculiar among the signatories is that, though I started this blog post talking about web scraping, in order to be topical and all, that’s not really the background from which I approached the topic. Indeed, this incident seems significantly dis-analogous to many if not most of the times I can recall that web scraping has become a hot-button issue for one reason or another. We’re in the realm of (potential) research misconduct, of science being done badly, and how lapses in ethics cannot always be pried apart from flaws of methodology. I think that the oversimplified “responsible conduct in science” lessons that we get spoon-fed in school tend to create the impression that ethical issues are when the results of a study are reliable, but the study was conducted in an objectionable way. However, that separation is too clean. Why should a study conducted with a lack of care be taken as reliable?
So, I could instead have started this blog post by invoking my very John Oliver voice and intoning, “Our top story tonight: Ethics. Otherwise known as the reason why just because you could, that doesn’t mean you should.”
But even that doesn’t quite get at the core of the matter.
If, for example, you are doing a study of 363 different online communities, each running its own server, and you don’t have the resources to examine the Terms of Service for each of those 363 installations and see if what you’re doing is in accord with all of them, then how can you say that you have the resources to evaluate the data you gathered on them?
As strange as this may sound, I don’t actually like it when I come across as mean. Every day, I wish I could use the Internet as a device for being kind to people I’ve never met.
But if science is to be a thing we value, we must hold it to account. That can mean booting scientists from the National Academy for sexual harassment, or taking an uncomfortably hard look at what telescopes we think we “need” to build, or even a critical analysis of a scandalous cauliflower.