The difference between Google and Aaron Swartz
By Kevin Webb
The opinions expressed are his own.
Reading about Aaron Swartz’s recent run-in with the law dredged up all kinds of feelings. I’m a long-time admirer of his work and was saddened to hear of his troubles. At the same time, reading the indictment, I was surprised by the seriousness of the charges and evidence against him.
I was also reminded of my own attempts at similar work, collecting and analyzing journal articles, patents, and various forms of metadata. I’ve lost count of how many hours I’ve spent sitting in basements of academic buildings, breaking federal laws in the pursuit of answers. And I was reminded of my colleagues who still spend their days painstakingly scraping data off the web–sometimes legally sometimes not–the name of academic inquiry.
None of us want to break the law. It’s simply that we don’t have a choice.
The mechanisms for sharing academic discourse are broken. They barely even function as systems for connecting interested parties within existing disciplines. Ask just about anyone who spends their time writing or consuming scholarly work and you will hear a litany of complaints about how poorly suited the academic publishing industry is to modern day collaboration.
I’ve spent most of my professional career just outside of the academy but have seen the failures of these systems first hand. I formed my opinion on the matter as a undergraduate assistant in a major neuroscience laboratory–building publishing tools to help the lab’s director break copyright law.
His work regularly appeared in and on the cover of major journals. Yet he was in a field that was moving faster than the journals could help facilitate. He took matters into his own hands by publishing the articles on the laboratory’s site, almost always violating the licensing terms of his own work (rights now held by Elsevier or AAAS, not the author). I asked about the legality what we were doing and was told not to worry. If the journals didn’t like him bending or breaking the law he’d publish elsewhere and it would be their loss.
As far as I know the publishers understood the bargain and never complained. Unfortunately this sort of non-aggression pact is available only to a select few. Your average untenured neuroscience professor doesn’t have the luxury of pissing off Science or Nature.
But for those of us interested in meta-analysis–these questions about questions that people like Aaron and myself are forced to pursue from basement wiring cabinets, scraping large swaths of text from the web–the hobbled and clunky tools for downloading PDFs through research library proxy servers, one poorly OCR’ed page at a time, simply do not work.
That said, it’s unclear what Swartz’s intentions for his scraped JSTOR content was. Some folks, including the FBI, have made the claim he was planning to redistribute the data. Others have pointed to his past research analyzing influence in academic writing. I have no insight into his real intentions, however, I do believe the latter goal is important and likely not possible without breaking the kinds of laws discussed above.
Also, it’s true that JSTOR does now offer a bulk interface for research users. That interface didn’t exist when I was doing my work. But it’s not clear it would have made any difference. There are many, many research applications, including mine, that are still not possible with approved means of accessing data. This essentially means that if you want to understand the collaborative nature of a specific field or follow the trajectory of and idea across disciplines, a reference librarian can’t help you. Instead, you have to become a felon.
What’s missing from the news articles about Swartz’s arrest is a realization that the methods of collection and analysis he’s used are exactly what makes companies like Google valuable to its shareholders and its users. The difference is that Google can throw the weight of its name behind its scrapers, just as my former boss used his name to set the terms with those publishing his work.
Aaron and the other “hackers and thieves” like him don’t have that option. But their work is no less important–they are collecting and organizing information in order to ask deep questions about the nature of academic discourse. Unfortunately for most, the structure of the publishing industry and the laws that surround creative works prevent these questions from being asked, at least without taking sometimes substantial risks.
It shouldn’t and doesn’t have to be this way but there are at least two main issues holding back progress:
First, as a society we’ve forgotten the Jeffersonian ideal that intellectual property laws should enable and encourage the spread of ideas and creative pursuits rather than lock them away. Many have fought for a return to this vision, however, the prospects for such change seem dim. If there’s anywhere this idea should still have a fighting chance, it’s within the walls of universities.
However, it is this most basic failure, our inability to create a rational set intellectual property laws, that necessitates the creation of things like JSTOR. We shouldn’t need it in the first place. Nor should anyone curious enough to ask questions as big as Aaron’s ever need to break JSTOR or the law to find answers.
We should offer people with big questions more than a trip to jail–we should celebrate their willingness to explore our collective intellectual heritage. Universities should take the lead in building the platforms needed to support such inquiry. It is an embarrassment that JSTOR is the best the academy has to offer.
But this leads to the second and perhaps more fundamental problem: journals are only partly about communicating. They’re also about controlling academic discourse. The editorial power held by journals and those that run them (quite different from those that own them) shapes most academic careers and the very structure of disciplines. It’s almost certain that pursuing new forms of collaboration and communication will reshape these power structures–sometimes subtly, sometimes not. That’s the nature of change.
Change, however, doesn’t come easily within academic communities. It should be no surprise that universities have done far more to free the content of their courses than they have the content of their publications. The former has economic value, however, the latter holds the keys to the academy itself.
This conservatism is at least in part responsible for why, despite the new possibilities offered by the web, most scholarly work is still published as though it were 1580. It’s also responsible for allowing a handful of powerful corporations to gate access to this knowledge and make authors pay for the privilege of signing away rights to their own work.
Sir Tim Berners-Lee invented the web to solve this very problem. Twenty years later it allows us to do almost everything imaginable–except get unfettered access to scholarly communication.
It is not technology that holds us back.
Aaron’s arrest should be a wake up call to universities–evidence of how fundamentally broken this core piece of their architecture remains despite decades of progress in advancing communication and collaboration.
The MIT staff who called the FBI would have been served better by calling the chancellor to ask, “How have we created a system that forces 25 year-olds to sneak around in the basement, hiding hard-drives in closets in order ask basic and important questions about our work? Can’t we do better?”
A version of this essay originally appeared on Webb’s blog.