The difference between Google and Aaron Swartz

By Kevin Webb
July 20, 2011

By Kevin Webb
The opinions expressed are his own.

Reading about Aaron Swartz’s recent run-in with the law dredged up all kinds of feelings. I’m a long-time admirer of his work and was saddened to hear of his troubles. At the same time, reading the indictment, I was surprised by the seriousness of the charges and evidence against him.

I was also reminded of my own attempts at similar work, collecting and analyzing journal articles, patents,  and various forms of metadata. I’ve lost count of how many hours I’ve spent sitting in basements of academic buildings, breaking federal laws in the pursuit of answers. And I was reminded of my colleagues who still spend their days painstakingly scraping data off the web–sometimes legally sometimes not–the name of academic inquiry.

None of us want to break the law. It’s simply that we don’t have a choice.

The mechanisms for sharing academic discourse are broken. They barely even function as systems for connecting interested parties within existing disciplines. Ask just about anyone who spends their time writing or consuming scholarly work and you will hear a litany of complaints about how poorly suited the academic publishing industry is to modern day collaboration.

I’ve spent most of my professional career just outside of the academy but have seen the failures of these systems first hand. I formed my opinion on the matter as a undergraduate assistant in a major neuroscience laboratory–building publishing tools to help the lab’s director break copyright law.

His work regularly appeared in and on the cover of major journals. Yet he was in a field that was moving faster than the journals could help facilitate. He took matters into his own hands by publishing the articles on the laboratory’s site, almost always violating the licensing terms of his own work (rights now held by Elsevier or AAAS, not the author). I asked about the legality what we were doing and was told not to worry. If the journals didn’t like him bending or breaking the law he’d publish elsewhere and it would be their loss.

As far as I know the publishers understood the bargain and never complained. Unfortunately this sort of non-aggression pact is available only to a select few. Your average untenured neuroscience professor doesn’t have the luxury of pissing off Science or Nature.

But for those of us interested in meta-analysis–these questions about questions that people like Aaron and myself are forced to pursue from basement wiring cabinets, scraping large swaths of text from the web–the hobbled and clunky tools for downloading PDFs through research library proxy servers, one poorly OCR’ed page at a time, simply do not work.

That said, it’s unclear what Swartz’s intentions for his scraped JSTOR content was. Some folks, including the FBI, have made the claim he was planning to redistribute the data. Others have pointed to his past research analyzing influence in academic writing. I have no insight into his real intentions, however, I do believe the latter goal is important and likely not possible without breaking the kinds of laws discussed above.

Also, it’s true that JSTOR does now offer a bulk interface for research users. That interface didn’t exist when I was doing my work. But it’s not clear it would have made any difference. There are many, many research applications, including mine, that are still not possible with approved means of accessing data. This essentially means that if you want to understand the collaborative nature of a specific field or follow the trajectory of and idea across disciplines, a reference librarian can’t help you. Instead, you have to become a felon.

What’s missing from the news articles about Swartz’s arrest is a realization that the methods of collection and analysis he’s used are exactly what makes companies like Google valuable to its shareholders and its users. The difference is that Google can throw the weight of its name behind its scrapers, just as my former boss used his name to set the terms with those publishing his work.

Aaron and the other “hackers and thieves” like him don’t have that option. But their work is no less important–they are collecting and organizing information in order to ask deep questions about the nature of academic discourse. Unfortunately for most, the structure of the publishing industry and the laws that surround creative works prevent these questions from being asked, at least without taking sometimes substantial risks.

It shouldn’t and doesn’t have to be this way but there are at least two main issues holding back progress:

First, as a society we’ve forgotten the Jeffersonian ideal that intellectual property laws should enable and encourage the spread of ideas and creative pursuits rather than lock them away.  Many have fought for a return to this vision, however, the prospects for such change seem dim. If there’s anywhere this idea should still have a fighting chance, it’s within the walls of universities.

However, it is this most basic failure, our inability to create a rational set intellectual property laws, that necessitates the creation of things like JSTOR. We shouldn’t need it in the first place. Nor should anyone curious enough to ask questions as big as Aaron’s ever need to break JSTOR or the law to find answers.

We should offer people with big questions more than a trip to jail–we should celebrate their willingness to explore our collective intellectual heritage. Universities should take the lead in building the platforms needed to support such inquiry. It is an embarrassment that JSTOR is the best the academy has to offer.

But this leads to the second and perhaps more fundamental problem: journals are only partly about communicating. They’re also about controlling academic discourse. The editorial power held by journals and those that run them (quite different from those that own them) shapes most academic careers and the very structure of disciplines. It’s almost certain that pursuing new forms of collaboration and communication will reshape these power structures–sometimes subtly, sometimes not. That’s the nature of change.

Change, however, doesn’t come easily within academic communities. It should be no surprise that universities have done far more to free the content of their courses than they have the content of their publications. The former has economic value, however, the latter holds the keys to the academy itself.

This conservatism is at least in part responsible for why, despite the new possibilities offered by the web, most scholarly work is still published as though it were 1580. It’s also responsible for allowing a handful of powerful corporations to gate access to this knowledge and make authors pay for the privilege of signing away rights to their own work.

Sir Tim Berners-Lee invented the web to solve this very problem. Twenty years later it allows us to do almost everything imaginable–except get unfettered access to scholarly communication.

It is not technology that holds us back.

Aaron’s arrest should be a wake up call to universities–evidence of how fundamentally broken this core piece of their architecture remains despite decades of progress in advancing communication and collaboration.

The MIT staff who called the FBI would have been served better by calling the chancellor to ask, “How have we created a system that forces 25 year-olds to sneak around in the basement, hiding hard-drives in closets in order ask basic and important questions about our work? Can’t we do better?”

A version of this essay originally appeared on Webb’s blog.

3 comments

We welcome comments that advance the story through relevant opinion, anecdotes, links and data. If you see a comment that you believe is irrelevant or inappropriate, you can flag it to our editors by using the report abuse links. Views expressed in the comments do not represent those of Reuters. For more information on our comment policy, see http://blogs.reuters.com/fulldisclosure/2010/09/27/toward-a-more-thoughtful-conversation-on-stories/

Swartz might have served everyone better, including himself, if he had asked those questions up front and tried to get them addressed, and maybe even solved, in a cooperative and open public manner. He seemed capable of leading an effort on this.

If the charges are factual, Swartz decided that his own agenda was worth breaking the law for — that and causing big problems for other users of the shared resources he wanted for his own purposes. It is claimed that he didn’t intend to redistribute the data. If so, his ends were to find information out only for himself or glean and present what he wanted, giving others what he judged they needed. So he put his own investigative goals and his status as a knowledge seeker as more important than that of others. If he really needed access to the data, there are legal methods to get it. Maybe he could have to put that creative mind to work to do it instead of resorting to criminal methods. He got caught knowingly breaking the law for his own self-justified ends instead of trying to pursue legal means, which perhaps would have required him to exercise patience and perseverance for his data binge. Taking precipitous shortcuts one can wind up falling. He took his chances and now he’ll have to face whatever consequences he was incurring yet trying to avoid from our legal system.

Posted by jkendall | Report as abusive

“Swartz might have served everyone better, including himself, if he had asked those questions up front and tried to get them addressed, and maybe even solved, in a cooperative and open public manner” But what makes you think he hasn’t tried this yet? I believe this is the key. He’s a know activist who has been pursuing precisely this with no success, therefore sometimos civil disobedience is the only way left… and a good signal that shows us that the way is broken and dangerous. I commit you to ask those questions up in front and let us all know the answer. I’m afraid I can imagine it. Don’t you?
On the other hand you are only supposing that what he pursued was only for he’s only advantage. Well you are free to think so, but knowing Aaron and his committment with ethical behaviours and the personal risks he has taken, and ethics does not have to have a relationship with obeying the law, does suggest that he was really working to help you and mankind in general. Agains, only a supposition worth taking into account.

Posted by oneras | Report as abusive

I am not entirely convinced that this is correct analysis. First, there is some speculation that this was intended as a test case–tell me who your friends are and I’ll tell you whom you’re trying to piss off). But what I am concerned about is that the issue is not so much the collection of data as the methods deployed in order to achieve it. The copyright issue itself would be a matter for JSTOR to resolve–if JSTOR said that Swartz complied with their policies (even if the agreement was reached after the fact, as it has), the feds would not be able to pursue the case on that basis alone. The hitching post for that horse is that the method of obtaining the information, in itself, was illegal–effectively tantamount to a break-in. Note, specifically, that Swartz is not being charged with copyright violations. He’s not being charged with conspiring to distribute the information. He’s being charged as a hacker, not as pirate.

What you’re alleging concerning your former employer amounts to standard operating practice in academia. In fact, many academics were significantly annoyed by the Kinko decision because it essentially allowed journal publishers to control academic and educational distribution of their work. They could no longer include even their own published papers in readers for their classes without paying royalties (usually highly overpriced) to the publishers. So, instead of using the published papers, many authors simply used earlier internal drafts for distribution. They use these in readers and they share them with colleagues who use them in their readers. They place their own papers on their own websites and, as far as I know, no author, no matter what his academic status, has been challenged by a journal publisher with respect to such placement (although many authors who consider it to be “an abundance of caution” simply place links to the pay-walled copies in their profiles).

It would a very interesting legal challenge if a publisher tried to go after an author of an article in one of its publications. Even if the suit had solid legal standing (which is by no means obvious), the publisher would risk a major boycott of its journals by the academic community. Imagine what happens if even one such academic publication loses all its submissions. No publisher is going to risk that. And this is exactly what you found in the response to you question.

But I am still a bit puzzled about this entire passage: “Yet he was in a field that was moving faster than the journals could help facilitate. He took matters into his own hands by publishing the articles on the laboratory’s site, almost always violating the licensing terms of his own work (rights now held by Elsevier or AAAS, not the author).” If the work had already been published in the journal, it was already available to academia and there would be no need to accelerate access. If it was in submission stages, then the manuscripts would fall into that gray area of copyright where licensing terms are of questionable validity–does the copyright belong to the journal or to the author. Certainly the RESULTS are open–in fact, the author is free to re-package the content of the paper and distribute it on his own–it’s not like the publisher can charge him with plagiarism of his own work. And if the publisher decided to sever ties prior to publication, as the author said, it was THEIR loss–the article or one resembling it in content would go to a different publication. And, as far as I know, no journal or its parent company holds rights to FUTURE publications, so if something was yet to be submitted, certainly the author held the copyright and had every right to publish it as he saw fit. In any case, claiming that self-distribution was “illegal” is a bit of a stretch, at best. And I did not even get into another hairy legal area–fair use.

Posted by ShadowFox | Report as abusive

[...] domain aren’t gong away. Here’s Demand Progress’ webpage on the issue, and Kevin Webb has interesting thoughts on the nature of information and discourse [...]

[...] commentators have not been so restrained. Kevin Webb, in a Reuters blog post said, “None of us want to break the law. It’s simply that we don’t have a [...]

[...] 1. Webb K. The difference between Google and Aaron Swartz. MediaFile Blog, Reuters. July 21, 2011. Available from: http://blogs.reuters.com/mediafile/2011/ 07/20/the-difference-between-google-and- aaron-swartz/ [...]

[...] domain aren’t gong away. Here’s Demand Progress’ webpage on the issue, and Kevin Webb has interesting thoughts on the nature of information and discourse [...]