The difference between Google and Aaron Swartz
By Kevin Webb
The opinions expressed are his own.
Reading about Aaron Swartzâ€™s recent run-in with the law dredgedÂ up all kinds of feelings. Iâ€™m a long-time admirer of his work and was saddened to hear of his troubles. At the same time, readingÂ the indictment,Â I was surprisedÂ by theÂ seriousnessÂ of the charges and evidence against him.
I was also reminded of my own attempts at similar work, collecting and analyzing journal articles, patents, Â and various forms of metadata. Iâ€™ve lost count of how many hours Iâ€™ve spent sitting in basements of academic buildings, breaking federal laws in the pursuit of answers. And I was reminded of my colleagues who still spend their days painstakingly scraping data off the webâ€“sometimes legally sometimes notâ€“the name of academic inquiry.
None of us want to break the law. Itâ€™s simply that we donâ€™t have a choice.
The mechanisms for sharing academic discourse are broken. They barely even function as systems for connecting interested parties within existing disciplines. Ask just about anyone who spends their time writing or consuming scholarly work and you will hear a litany of complaints about how poorly suited the academicÂ publishing industry is to modern day collaboration.
Iâ€™ve spent most of my professional career just outside of the academy but have seen the failures of these systems first hand. I formed my opinion on the matter as aÂ undergraduateÂ assistant in a major neuroscience laboratoryâ€“building publishing tools to help the labâ€™s director break copyright law.
His work regularly appeared in and on the cover of major journals. Yet he was in a field that was moving faster than the journals could helpÂ facilitate. He took matters into his own hands by publishing the articles on the laboratoryâ€™s site, almost always violatingÂ the licensing termsÂ of his own work (rights now held by Elsevier or AAAS, not the author). I asked about the legality what we were doing and was told not to worry. If the journals didnâ€™t like him bending or breaking the law heâ€™d publish elsewhere and it would be their loss.
As far as I know the publishers understood theÂ bargainÂ and never complained.Â Unfortunately this sort of non-aggression pact is available only to a select few. Your average untenuredÂ neuroscienceÂ professor doesnâ€™t have the luxury of pissing off Science or Nature.
But for those of us interested in meta-analysisâ€“these questions about questions that people like Aaron and myself are forced to pursue from basement wiringÂ cabinets, scraping large swaths of text from the webâ€“the hobbled and clunky tools for downloading PDFs through research library proxy servers, one poorly OCRâ€™ed page at a time, simply do not work.
That said, it’s unclear what Swartz’s intentions for his scraped JSTOR content was. Some folks, including the FBI, have made the claim he was planning to redistribute the data. Others have pointed to his past research analyzing influence in academic writing. I have no insight into his real intentions, however, I do believe the latter goal is important and likely not possible without breaking the kinds of laws discussed above.
Also, itâ€™s true that JSTOR does now offer a bulk interface for research users. That interface didnâ€™t exist when I was doing my work. But itâ€™s not clear it would have made any difference. There are many, many research applications, including mine, that are still not possible with approved means of accessing data.Â This essentially means that if you want to understand the collaborative nature of a specific field or follow the trajectory of and idea acrossÂ disciplines, a referenceÂ librarianÂ canâ€™t help you. Instead, you have to become a felon.
Whatâ€™s missing from the news articles about Swartzâ€™s arrest is a realization that the methods of collection and analysis heâ€™s used are exactly what makes companies like Google valuable to its shareholders andÂ its users. The difference is that Google can throw the weight of its name behind its scrapers, just as my former boss used his name to set the terms with those publishing his work.
Aaron and the other â€śhackers and thievesâ€ť like him donâ€™t have that option. But their work is no less importantâ€“they are collecting and organizing information in order to ask deep questions about the nature of academic discourse. Unfortunately for most, the structure of the publishing industryÂ and the laws that surround creative works prevent these questions from being asked, at least without taking sometimes substantial risks.
It shouldnâ€™t and doesnâ€™t have to be this way but there are at least two main issues holding back progress:
First, as a society weâ€™ve forgottenÂ the Jeffersonian ideal that intellectual property laws should enable and encourage the spread of ideas and creativeÂ pursuitsÂ rather than lock them away. Â Many have fought for a return to this vision, however, theÂ prospectsÂ for suchÂ changeÂ seem dim. If thereâ€™s anywhere this idea should still have a fighting chance, itâ€™s within the walls of universities.
However, it is this most basic failure, ourÂ inability to create a rational set intellectual property laws, thatÂ necessitatesÂ the creation of things like JSTOR. We shouldnâ€™t need it in the first place. Nor should anyone curious enough to ask questions as big as Aaronâ€™s ever need to break JSTOR or the law to find answers.
We should offer people with big questionsÂ more than a trip to jailâ€“we should celebrate their willingness to explore our collective intellectualÂ heritage.Â Universities should take the lead in building theÂ platforms needed to support suchÂ inquiry. It is an embarrassment that JSTOR is the best the academy has to offer.
But this leads to the second and perhaps more fundamental problem: journals are only partly about communicating. Theyâ€™re also about controllingÂ academic discourse. The editorial power held by journals and those that run them (quite different from those that own them) shapes most academic careers and the very structure ofÂ disciplines. Itâ€™s almost certain that pursuing new forms of collaboration and communication will reshape these power structuresâ€“sometimesÂ subtly, sometimes not. Thatâ€™s the nature of change.
Change, however, doesnâ€™t come easily withinÂ academicÂ communities. It should be no surprise that universities have done far moreÂ to free the content of their courses than they have the content of their publications. The former has economic value, however, the latter holds the keys to the academy itself.
ThisÂ conservatismÂ is at least in part responsible for why, despite the new possibilities offered by the web, most scholarly work is still published as though it were 1580. Itâ€™s also responsible for allowing a handful of powerfulÂ corporationsÂ to gate access to this knowledge and make authors pay for the privilege of signing away rights to their own work.
Sir Tim Berners-Lee invented the web to solve this very problem. Twenty years later it allows us to do almost everything imaginableâ€“except get unfettered access to scholarly communication.
It is not technology that holds us back.
Aaronâ€™s arrest should be a wake up call to universitiesâ€“evidence of how fundamentally broken this core piece of theirÂ architectureÂ remainsÂ despiteÂ decades of progress in advancing communication and collaboration.
The MIT staff who called the FBI would have been served better by calling the chancellor to ask, â€śHow have we created a system that forces 25 year-olds to sneakÂ around in the basement, hiding hard-drives inÂ closetsÂ in order ask basic and important questions about our work? Canâ€™t we do better?â€ť
A version of this essay originally appeared on Webb’s blog.