The difference between Google and Aaron Swartz
By Kevin Webb
The opinions expressed are his own.
Reading about Aaron Swartzâs recent run-in with the law dredgedÂ up all kinds of feelings. Iâm a long-time admirer of his work and was saddened to hear of his troubles. At the same time, readingÂ the indictment,Â I was surprisedÂ by theÂ seriousnessÂ of the charges and evidence against him.
I was also reminded of my own attempts at similar work, collecting and analyzing journal articles, patents, Â and various forms of metadata. Iâve lost count of how many hours Iâve spent sitting in basements of academic buildings, breaking federal laws in the pursuit of answers. And I was reminded of my colleagues who still spend their days painstakingly scraping data off the webâsometimes legally sometimes notâthe name of academic inquiry.
None of us want to break the law. Itâs simply that we donât have a choice.
The mechanisms for sharing academic discourse are broken. They barely even function as systems for connecting interested parties within existing disciplines. Ask just about anyone who spends their time writing or consuming scholarly work and you will hear a litany of complaints about how poorly suited the academicÂ publishing industry is to modern day collaboration.
Iâve spent most of my professional career just outside of the academy but have seen the failures of these systems first hand. I formed my opinion on the matter as aÂ undergraduateÂ assistant in a major neuroscience laboratoryâbuilding publishing tools to help the labâs director break copyright law.
His work regularly appeared in and on the cover of major journals. Yet he was in a field that was moving faster than the journals could helpÂ facilitate. He took matters into his own hands by publishing the articles on the laboratoryâs site, almost always violatingÂ the licensing termsÂ of his own work (rights now held by Elsevier or AAAS, not the author). I asked about the legality what we were doing and was told not to worry. If the journals didnât like him bending or breaking the law heâd publish elsewhere and it would be their loss.
As far as I know the publishers understood theÂ bargainÂ and never complained.Â Unfortunately this sort of non-aggression pact is available only to a select few. Your average untenuredÂ neuroscienceÂ professor doesnât have the luxury of pissing off Science or Nature.
But for those of us interested in meta-analysisâthese questions about questions that people like Aaron and myself are forced to pursue from basement wiringÂ cabinets, scraping large swaths of text from the webâthe hobbled and clunky tools for downloading PDFs through research library proxy servers, one poorly OCRâed page at a time, simply do not work.
That said, it’s unclear what Swartz’s intentions for his scraped JSTOR content was. Some folks, including the FBI, have made the claim he was planning to redistribute the data. Others have pointed to his past research analyzing influence in academic writing. I have no insight into his real intentions, however, I do believe the latter goal is important and likely not possible without breaking the kinds of laws discussed above.
Also, itâs true that JSTOR does now offer a bulk interface for research users. That interface didnât exist when I was doing my work. But itâs not clear it would have made any difference. There are many, many research applications, including mine, that are still not possible with approved means of accessing data.Â This essentially means that if you want to understand the collaborative nature of a specific field or follow the trajectory of and idea acrossÂ disciplines, a referenceÂ librarianÂ canât help you. Instead, you have to become a felon.
Whatâs missing from the news articles about Swartzâs arrest is a realization that the methods of collection and analysis heâs used are exactly what makes companies like Google valuable to its shareholders andÂ its users. The difference is that Google can throw the weight of its name behind its scrapers, just as my former boss used his name to set the terms with those publishing his work.
Aaron and the other âhackers and thievesâ like him donât have that option. But their work is no less importantâthey are collecting and organizing information in order to ask deep questions about the nature of academic discourse. Unfortunately for most, the structure of the publishing industryÂ and the laws that surround creative works prevent these questions from being asked, at least without taking sometimes substantial risks.
It shouldnât and doesnât have to be this way but there are at least two main issues holding back progress:
First, as a society weâve forgottenÂ the Jeffersonian ideal that intellectual property laws should enable and encourage the spread of ideas and creativeÂ pursuitsÂ rather than lock them away. Â Many have fought for a return to this vision, however, theÂ prospectsÂ for suchÂ changeÂ seem dim. If thereâs anywhere this idea should still have a fighting chance, itâs within the walls of universities.
However, it is this most basic failure, ourÂ inability to create a rational set intellectual property laws, thatÂ necessitatesÂ the creation of things like JSTOR. We shouldnât need it in the first place. Nor should anyone curious enough to ask questions as big as Aaronâs ever need to break JSTOR or the law to find answers.
We should offer people with big questionsÂ more than a trip to jailâwe should celebrate their willingness to explore our collective intellectualÂ heritage.Â Universities should take the lead in building theÂ platforms needed to support suchÂ inquiry. It is an embarrassment that JSTOR is the best the academy has to offer.
But this leads to the second and perhaps more fundamental problem: journals are only partly about communicating. Theyâre also about controllingÂ academic discourse. The editorial power held by journals and those that run them (quite different from those that own them) shapes most academic careers and the very structure ofÂ disciplines. Itâs almost certain that pursuing new forms of collaboration and communication will reshape these power structuresâsometimesÂ subtly, sometimes not. Thatâs the nature of change.
Change, however, doesnât come easily withinÂ academicÂ communities. It should be no surprise that universities have done far moreÂ to free the content of their courses than they have the content of their publications. The former has economic value, however, the latter holds the keys to the academy itself.
ThisÂ conservatismÂ is at least in part responsible for why, despite the new possibilities offered by the web, most scholarly work is still published as though it were 1580. Itâs also responsible for allowing a handful of powerfulÂ corporationsÂ to gate access to this knowledge and make authors pay for the privilege of signing away rights to their own work.
Sir Tim Berners-Lee invented the web to solve this very problem. Twenty years later it allows us to do almost everything imaginableâexcept get unfettered access to scholarly communication.
It is not technology that holds us back.
Aaronâs arrest should be a wake up call to universitiesâevidence of how fundamentally broken this core piece of theirÂ architectureÂ remainsÂ despiteÂ decades of progress in advancing communication and collaboration.
The MIT staff who called the FBI would have been served better by calling the chancellor to ask, âHow have we created a system that forces 25 year-olds to sneakÂ around in the basement, hiding hard-drives inÂ closetsÂ in order ask basic and important questions about our work? Canât we do better?â
A version of this essay originally appeared on Webb’s blog.