Category Archives: Transparency

The Adventure of the Scandalous Cauliflower

(very John Oliver voice) Good evening. Tonight’s top story: Web scraping. That’s when someone uses an automated tool to download, typically, large amounts of information from the Web and save it on their own computer for their own purposes. Now, there are situations where this is undoubtedly for the general good, like saving climate data so that it can’t be made to vanish with an act of government whimsy. But when personal information enters the picture, the ethical considerations can change, and there can be times when “it was available to the public!” becomes little more reassuring than, “Yes, I am following you while I happen to be carrying this camera, but you were walking outside, so you have to be OK with being seen, oh, and is this your regular bus stop?”.

Just a few days ago, the New York Times ran a provocative and alarming piece you have quite possibly already seen, titled “The Secretive Company That Might End Privacy as We Know It” — oh, you’ve read it? Thanks for telling us, but don’t worry, Mark Zuckerberg already knows. Basically, the M.O. of Clearview AI was to scrape pictures from, as the Times says, “Facebook, YouTube, Venmo and millions of other websites,” and then use the resulting massive database to fuel the wet dreams of petty aspiring autocrats everywhere.

Now, as with so many of the depressing and horrifying developments of modern life, there is a level on which this feels absolutely unsurprising. Just another way in which the utopian promise of the Internet [stock art of Tron appears over shoulder] was betrayed and perverted into something irredeemably toxic by some guy out to make a quick buck. Back in 1990, when (now Sir) Tim Berners-Lee created the first ever website at CERN, the physics laboratory that now hosts the Large Hadron Collider, surely this disaster was not what he had in mind. Even today, surely we can count on scientists to rise above the venal impulses of the money-grubbers, hold themselves to the highest ethical standards in the pursuit of truth, act with discretion to the communities that their research affects and I’m just fucking with you. Scientists are people. Sometimes greedy, often fallible. And the process of correcting an error, even one due to simple carelessness, can be remarkably painful for all concerned.

I have been involved in writing an open letter in response to what I myself like to call “The Adventure of the Scandalous Cauliflower.” That open letter is available here in PDF and basic HTML. I was not the first person to call attention to this matter, nor perhaps even the loudest, but I like formatting academic documents, so the organizing somewhat fell to me by default. As typically happens in cases where an open letter gets written, everyone involved has their own opinions that may stretch beyond its margins, and I’m sure that I have my own takes (or at least choices of emphasis) that would not be co-signed by all of the letter’s signatories. This blog post is, beyond providing a pointer to the open letter, my attempt to underline that my idiosyncrasies should not be attributed to anybody else unless they have expressly indicated that they share those particular takes of mine.

The very short version is that a group of researchers at the University of Milan hoovered up a large quantity of social-media data without informing any of the communities they were studying, violated the Terms of Service of a community in that set explicitly devoted to scholars and academics, and thanks to a truly impressive feat of analyzing without thinking, concluded that the topic of cauliflower is a serious transgression of their subjects’ social norms.

This is how the letter begins:

We are writing to raise grave concerns regarding the ethics and methodology of “Mastodon Content Warnings: Inappropriate Contents in a Microblogging Platform,” by Matteo Zignani et al. of the University of Milan. The issues with this paper are sufficiently severe that the paper’s dataset has been removed from Harvard’s Dataverse repository. This open letter will explain the background of this removal and urge further action on the part of the paper’s authors, the University of Milan, and the Association for the Advancement of Artificial Intelligence (AAAI), who have published the paper in their conference proceedings. As we detail below, the data analysed in this paper was not collected ethically, failing to take even simple steps to anonymize the data released with the paper, and fundamental errors of methodology make its results irrelevant.

Mastodon is a decentralized, community-operated microblogging platform created in early 2016 by Eugen Rochko and is based on open protocols that allow people to communicate across different servers. Anyone who wishes to create a Mastodon server, or instance, can do so by downloading and installing the Mastodon software. Users who register accounts at an instance can then share social-media posts with other users on that instance as well as with other instances. The interconnection of different servers is known as federation.

Other things that could have ended up in the letter if I had been left to my own devices and if it weren’t already going to be fairly long:

  • history of Mastodon’s “Content Warning” feature
  • variability in CW practices and discontent caused thereby
  • more details on the underlying ActivityPub protocol and its peculiarities
  • the ongoing development of fediverse software that isn’t Mastodon
  • the paper’s generally rushed, “we need a thing for this conference” feel
  • why it’s a good thing that professional codes of conduct can be a less blunt instrument than the law
  • the ambiguities fostered when the same software can be used as a means of publication and of social interaction

In short: Lots of indefinitely deep rabbit holes, and opportunities to say “a balance must be struck between the need for X and for Y” — though now that I’ve typed that phrase, I have to wonder how much that mode of rhetoric fuels compulsive centrism. Moreover, these are topics where it would be harder to pull together a core of agreement. I mean, would 45 other fediversians sign on to anything I wrote myself on my own about all that?

One way in which I am perhaps peculiar among the signatories is that, though I started this blog post talking about web scraping, in order to be topical and all, that’s not really the background from which I approached the topic. Indeed, this incident seems significantly dis-analogous to many if not most of the times I can recall that web scraping has become a hot-button issue for one reason or another. We’re in the realm of (potential) research misconduct, of science being done badly, and how lapses in ethics cannot always be pried apart from flaws of methodology. I think that the oversimplified “responsible conduct in science” lessons that we get spoon-fed in school tend to create the impression that ethical issues are when the results of a study are reliable, but the study was conducted in an objectionable way. However, that separation is too clean. Why should a study conducted with a lack of care be taken as reliable?

So, I could instead have started this blog post by invoking my very John Oliver voice and intoning, “Our top story tonight: Ethics. Otherwise known as the reason why just because you could, that doesn’t mean you should.”

But even that doesn’t quite get at the core of the matter.

If, for example, you are doing a study of 363 different online communities, each running its own server, and you don’t have the resources to examine the Terms of Service for each of those 363 installations and see if what you’re doing is in accord with all of them, then how can you say that you have the resources to evaluate the data you gathered on them?

As strange as this may sound, I don’t actually like it when I come across as mean. Every day, I wish I could use the Internet as a device for being kind to people I’ve never met.

But if science is to be a thing we value, we must hold it to account. That can mean booting scientists from the National Academy for sexual harassment, or taking an uncomfortably hard look at what telescopes we think we “need” to build, or even a critical analysis of a scandalous cauliflower.

Predator/prey or Perish

Looking at academic publishing from the perspective of Fully Automated Luxury Gay Space Communism is an interesting experience.

Consider, for example, the term predatory publisher for shady outfits that will accept anything for the right fee and put it on a website that calls itself a “journal”. Scummy behavior, right? But is it really “predatory”? What fraction, exactly, of their customers are being conned, and how many are walking into the deal with their eyes wide open? A used-car salesman might be a sleaze, but if you’re going to his dealership to pay cash for a getaway car, the relationship is more of a symbiosis.

I’m sure it’s convenient for the legacy institutions to present the situation as saintly scholars being exploited by deceptive newcomers. [cough]

Suppose the Web came to be, but there never were any respectable Open Access journals. No “Open Letter to Scientific Publishers” in 2001, so no Public Library of Science; no Budapest Initiative in 2002 or Berlin Declaration in 2003. Would the morass of “predatory” OA really look all that different? Perhaps not. Websites are cheap, calling yourself a journal is easy, and as we just noted, there’s a ready market.

But without the cover of PLOS and the like, would “predatory” OA have a veneer of respectability to offer its customers? Well, consider that paying to attend conferences is a thing that academia finds universally respectable. So, a “predator” could do what outfits like WASET do now: offer “conferences” with no standards, no dedicated space, perhaps not even a physical event. And if you’ve got a paper, great! For only a modest additional fee, it can go in the conference proceedings, which will conveniently be available online.

Quality is always the hard course of action. Legitimate OA journals were optional; only the pay-to-play racket was inevitable.

The Mills of Institutional Review Boards

Back in 2017, nominal intellectuals Peter Boghossian and James Lindsay proved that they were willing to invest a lot of energy in lying. Not content to rest on those laurels, they joined with Helen Pluckrose and soon redoubled their efforts. I didn’t have much to say about their “Grievance Studies” brouhaha when that story broke last fall, apart from the observation that the Bogdanov brothers had a better success rate — six out of six bullshit papers accepted! — and I’m still waiting for Steven Pinker to call the whole of physics a heap of postmodern rubbish.

(He won’t; he’s too busy appropriating its respectability.)

Today, I learned that Boghossian’s institutional review board found that his actions in the “Grievance Studies” hoax constitute research misconduct.

EDIT TO ADD: Further commentary on the ethics of academic hoaxes.

What Would I Buy With $3 Million for Math[s]?

Leading off the topic of my previous post, I think it’s a good time to ask what we can do with resources that are already allocated. How can we fine-tune the application of resources already set aside for a certain purpose, and so achieve the best outcome in the current Situation?

This post will be a gentle fantasy, because sometimes, in the Situation, we need that, or because that’s all I can do today.

Last month, Evelyn Lamb asked, how should we revamp the Breakthrough Prize for mathematics? This is an award with $3 million attached, supported by tech billionaires. A common sentiment about such awards, a feeling that I happen to share, is that they go to people who have indeed accomplished good things, but on the whole it isn’t a good way to spend money. Picking one person out of a pool of roughly comparable candidates and elevating them above their peers doesn’t really advance the cause of mathematics, particularly when the winner already has a stable position. Lamb comments,

$\$3$ million a year could generously fund 30 postdoc years (or provide 10 3-year postdocs). I still think that wouldn’t be a terrible idea, especially as jobs in math are hard to come by for fresh PhD graduates. But […] more postdoc funding could just postpone the inevitable. Tenure track jobs are hard to come by in mathematics, and without more of them, the job crunch will still exist. Helping to create permanent tenured or tenure-track positions in math would ease up on the job crisis in math and, ideally, make more space for the many deserving people who want to do math in academia. […] from going to the websites of a few major public universities, it looks like it’s around $2.5 million to permanently endow a chair at that kind of institution.

I like the sound of this, but let’s not forget: If we have $3 million per year, then we don’t have to do the same thing every year! My own first thought was that if you can fund 10 postdocs for three years apiece, you can easily pay for 10 new open-source math textbooks. In rough figures, let us say that it takes about a year to write a textbook on material you know well. Then, the book has to be field-tested for at least a semester. To find errors in technical prose, you need to find people who don’t already know what it’s supposed to say, and have them work through the whole thing.

If we look at, say, what MIT expects of undergrad math majors, we can work up a list of courses:
Continue reading What Would I Buy With $3 Million for Math[s]?

Moderation In All Things

After attending the annual ScienceOnline meetings in North Carolina for many years, this time around, I won’t be going. The primary reason has nothing to do with the upsets in that community of late (oh, yes, I have thoughts, but they’re not for the sharing today). Oh, sure, not seeing the people I’d hoped to see because ongoing problems drove them away—that’s a fine secondary reason. Before and above all that, though, is the fact that I’m mid-PhD. I realized I could no longer justify the time, the stress and, indeed, the carbon footprint of traveling to attend #scio14.

What can one do? I revile air travel more every year. I don’t have time/energy to prepare for the conference beforehand, or to follow up on anything discussed there after. My proposal for the session I was to moderate was, to summarize only slightly, “hey let’s build this website”. Must I travel for that??
Continue reading Moderation In All Things

The Transparent Academy

You know what I’d like to see? I’d like to have all the course materials necessary for a good, solid undergraduate physics degree available online, free to access and licensed in a way which permits reuse and remixing. I’d like it all in one place, curated, with paths through it mapped out to define a curriculum. When I say all the course materials, I mean that this webzone should have online textbooks; copies of, or at least pointers to, relevant primary literature; video lectures; simulation codes; sample datasets on which to practice analysis; homework and exam problems with worked-out solutions; interactive quizzes, so we can be trendy; and ways to order affordable experimental equipment where that is possible, e.g., yes on diffraction gratings, but probably no on radioactive sources. I’m talking about physics, because that’s what I nominally know about, but I’d like this to encompass the topics which I got sent to other departments to learn about, like the Mathematics Department’s courses in single- and multivariable calculus, differential equations, linear algebra, group theory, etc.

One way to think about it is this: suppose you had to teach a physics class to first- or second-year undergraduates. Could you get all the textual materials you need from Open-Access sources on the Web? Would you know where to look?

What with Wikipedia, OpenCourseWare, review articles on the arXiv, science blogs, the Khaaaaaan! Academy and so forth, we probably already have a fair portion of this in various places. But the operative word there is various. I, at least, would like it gathered together so we can know what’s yet to be done. With a project like, say, Wikipedia, stuff gets filled in based on what people feel like writing about in their free time. So articles grow by the cumulative addition of small bits, and “boring” content — parts of the curriculum which need to be covered, but are seldom if ever “topical” — doesn’t get much attention.

I honestly don’t know how close we are to this ideal. And, I don’t know what would be the best infrastructure for bringing it about and maintaining it. Idle fantasies and pipe dreams!

I’d like to have this kind of resource, not just for the obvious practical reasons, but also because it would soothe my conscience. I’d like to be able to tell people, “Yes, physics and mathematics are difficult, technical subjects. The stuff we say often sounds like mystical arcana. But, if you want to know what we know, all we ask is time and thinking — we’ve removed every obstacle to your understanding which we possibly can.”

I don’t think this would really impact the physics cranks and crackpots that much, but that’s not the problem I’m aiming to (dreaming that we will) solve. Disdain for mathematics is one warning sign of a fractured ceramic, yes: I’ve lost count of the number of times I’ve seen websites claiming to debunk Einstein “using only high-school algebra!” We could make learning the mathematical meat of physics easier, but that won’t significantly affect the people whose crankishness is due to personality and temperament. Free calculus lessons, no matter how engaging, won’t help those who’ve dedicated themselves to fighting under the banner of Douche Physik.

Alchemists work for the people. —Edward Elric

Geek Fuel

A couple noteworthy items:

Tim Farley, who spoke at TAM 6 about Internet tools which skeptics can use, has put the transcript for his talk online. There’s a great deal of stuff in there to geek over; we can at least have a lot of fun while we’re failing to save the world. Farley makes me feel shame at the weakness of my RSS-fu.

Via the diligent Peter Suber comes word of Open Education News, an aggregator site for open-education developments (new teaching tools, textbooks which are free as in speech or as in beer, etc.). It’s interesting to see how much younger OE looks and feels than OA, with none of the crotchety arguments about definitions and fireballs hurled at traditional enemies.

Medical Journalism is Ill

Gary Schwitzer asks,

Is the news media doing a good job of reporting on new treatments, tests, products, and procedures? Ray Moynihan and colleagues analyzed how often news stories quantified the costs, benefits, and harms of the interventions being discussed, and how often they reported potential conflicts of interest in story sources [1]. Of the 207 newspaper and television stories that they studied, 83 did not report the benefits of medications quantitatively, and of the 124 stories that did quantify the benefits of medications, only 18 presented both relative and absolute benefits. Of all the stories, 53% had no information about potential harms of the treatment, and 70% made no mention of treatment costs. Of 170 stories that cited an expert or a scientific study, 85 (50%) cited at least one with a financial tie to the manufacturer of the drug, a tie that was disclosed in only 33 of the 85 stories.

Moynihan et al. (2000) inspired some Australians to do a similar survey in 2004, which found after six months that Australian print and online news coverage of medical advances was “poor.” Now, Schwitzer has done a more extensive survey of United States media. The punchline is as follows:

In our evaluation of 500 US health news stories over 22 months, between 62%–77% of stories failed to adequately address costs, harms, benefits, the quality of the evidence, and the existence of other options when covering health care products and procedures. This high rate of inadequate reporting raises important questions about the quality of the information US consumers receive from the news media on these health news topics.

Details are available at PLoS Medicine. Now, we just need somebody to pay for a similar survey of non-medical science reporting.

(Tip o’ the fedora to Steve Novella.)

Requiescat in Wikipace?

Alun Salt, an archaeology PhD student and therefore a elitist expert by Internet standards, used to edit Wikipedia, but after five hundred-odd edits, he decided to give up and become Wikipedian Emeritus. In giving his reasons, he also made a prediction:

From the limited information available it looks like the combination of Knol [see here] and Wikipedia’s policies will be a Wikipedia-killer.

First off Knol will attract experts because of its emphasis on authorship. Additional features like collaborative authoring will attract people who can work together. You can also bet that Google will be marketing Knol as a tool to experts. Even without migration from Wikipedia that will be a blow. The material will be protected from plagiarism. If there’s one company that can find copies on the web, it’s Google.

I find the idea of having one company in charge of hosting content and providing search functionality a little, well, spooky — and yes, that already applies to YouTube and Blogger — but moving on:
Continue reading Requiescat in Wikipace?

Open Access Terminology

Peter Suber and Stevan Harnad have been trying to clarify the different meanings of the term “Open Access.” Recently, these two advocates of the OA cause issued a joint statement which began as follows:

The term “open access” is now widely used in at least two senses. For some, “OA” literature is digital, online, and free of charge. It removes price barriers but not permission barriers. For others, “OA” literature is digital, online, free of charge, and free of unnecessary copyright and licensing restrictions. It removes both price barriers and permission barriers. It allows reuse rights which exceed fair use.

Suber and Harnad proposed using “weak OA” to describe the former kind, literature which is “price-barrier-free,” and “strong OA” for the latter, “permission-barrier-free” variety. Shortly thereafter, however, people got flustered and pointed out that “weak OA” is unnecessarily pejorative. After all, even lowering price barriers is a good thing, and there’s no reason to make life harder for the people trying to do that. Better terminology is needed. This is a chance for all you aspiring wordanistas to lead your very own revolution! (Given the audience which finds OA issues of interest, your revolution will be well-blogged, but not televised.) Can you come up with a better alternative than the current options like “Basic OA” versus “Full OA”?

Hopefully, whatever terms we end up using to denote these gradations in scale will be more illuminating than the ones employed by the US Post Office. Every time I walk in to post something, I find myself befuddled by Express, Priority and First-Class designations: which one is actually the fastest and the most expensive? When each term is tarted up to sound as exciting as possible, their ability to indicate a scale of any kind is ruined.

Wilson on Wikipedia

Geologist Mark Wilson has an interesting opinion piece at Inside Higher Ed,Professors Should Embrace Wikipedia.” While it was published on April Fool’s Day, one can take it in full seriousness. I don’t agree with it fully, but I believe the points it raises are well worth discussing. Here’s the nub of his argument:

What Wikipedia too often lacks is academic authority, or at least the perception of it. Most of its thousands of editors are anonymous, sometimes known only by an IP address or a cryptic username. Every article has a “talk” page for discussions of content, bias, and organization. “Revert” wars can rage out of control as one faction battles another over a few words in an article. Sometimes administrators have to step in and lock a page down until tempers cool and the main protagonists lose interest. The very anonymity of the editors is often the source of the problem: how do we know who has an authoritative grasp of the topic?

That is what academics do best. We can quickly sort out scholarly authority into complex hierarchies with a quick glance at a vita and a sniff at a publication list. We make many mistakes doing this, of course, but at least our debates are supported with citations and a modicum of civility because we are identifiable and we have our reputations to maintain and friends to keep. Maybe this academic culture can be added to the Wild West of Wikipedia to make it more useful for everyone?

And here’s his proposal for action:

I propose that all academics with research specialties, no matter how arcane (and nothing is too obscure for Wikipedia), enroll as identifiable editors of Wikipedia. We then watch over a few wikipages of our choosing, adding to them when appropriate, stepping in to resolve disputes when we know something useful. We can add new articles on topics which should be covered, and argue that others should be removed or combined. This is not to displace anonymous editors, many of whom possess vast amounts of valuable information and innovative ideas, but to add our authority and hard-won knowledge to this growing universal library.

An old saying has it that of all kinds of politics, academic is the nastiest, because the stakes are the lowest. One might fret that legions of quarrelsome professors would trample all over the pages pertaining to the controversies in their own specialized fields, bringing all the fury of the Dawkins/Gould or Fodor/Dennett deathmatches to the world of Wikipedia. However, I know of no evidence suggesting that these arguments would really be any more vituperative than the ones which already occur. Furthermore, Wikipedians with advanced degrees already exist; Wilson’s proposal would only bring in a larger number of them, perhaps with a shared ethos or sense of common purpose.
Continue reading Wilson on Wikipedia