Our social-media amnesia
It began with a hashtag — #fitn. On the eve of January’s Republican presidential primary, it seemed that every member of the political press, election observer, and New Hampshirite had adopted #fitn as a sort of quasi-official tag. It was a reference to “First in the Nation,” a long-used political phrase that dates back to the 1920s. As I watched those tweets fly by, it struck me how ubiquitous its shorthand version had become online. Where did the hashtag come from? Who first injected it into the tweet stream? Twitter’s internal search engine, as it turns out, only goes back so far. I fired up Topsy.com, by general consensus the best tweet search tool going today. But I hit the outer limits of Topsy’s archive far before I uncovered my proto-tweet. I asked Twitter HQ. No go. A smallish company, it lacks the resources, they said, to track a hashtag back to its starting point.
My struggle to find the origins of #fitn is not unique. We’re tweeting more than 340 million times a day, conducting a robust public conversation on Twitter. Yet, even on Twitter’s sixth birthday today, we still can’t track it, can’t search it, can’t access our archives. There is no public record. Is that really so much to ask?
Maybe, yes. Consider the technological constraints. Brewster Kahle, who runs the Internet Archive, a non-profit online repository for 150 billion Web pages, told me startups have a hard time being “archive aware.” For them there are more pressing concerns, like integrating servers and avoiding “fail whales.”
Twitter’s internal search tool only reaches back a week or so before you get a note saying that older tweets are not available. Twitter does, to its credit, publish an interface that allows others to pull information from its services. But there’s a built-in cap on how many tweets can be accessed that way. (It changes, but at one recent point it was in the couple-thousand-tweet range.) And so, we’re left with our current status quo: tweets that seem to fall into a black hole. (Twitter declined to speak on record for this piece.)
Who cares, right? These are tweets, after all. Somehow we’ve survived as a culture without recording, say, every phone call we made in the ’80s. But Twitter’s centrality to the political conversation from the U.S. to Egypt has already made it more than mere ephemera. It’s still the early days of the social-media era, and our vantage point is not a particularly good one to decide what’s worth saving and what’s not.
“In the Elizabethan era,” points out Michael Lesk, chair of the Rutgers University Department of Library and Information Science, “plays weren’t saved because only sermons and poetry were considered literature.”
Of course, it’s not really us making the decision about what to save. It’s Twitter and the other big players in the digital communications realm that are making it. For a few years, it was tempting to let the Web’s natural order take care of things. If Twitter was busy helping people produce huge caches of information, Google was helping to make it searchable. Starting in 2009, Google and Twitter had a deal to include a healthy helping of tweets in its real-time search results. But with the rollout of Google+ and, especially, Google’s choice to give special treatment to Google+ posts in its organic search results, Google and Twitter have gone from complementary online players to competitors; their real-time search agreement was allowed to lapse last summer.
So, if private industry isn’t going to save our tweets, maybe the public sector will? About two years ago, Twitter and the Library of Congress signed an agreement whereby the San Francisco company would gift the institution a copy of its tweet archive and the library would manage access to it. But, today, getting that project off the ground is “going to be a while,” according to Library of Congress public relations specialist Jennifer Gavin. Why? “The process of how to serve [the tweet archive] out to researchers while still maintaining the parameters set by our agreement with Twitter is still being worked out,” Gavin says. According to that agreement, access to the archive is primarily for “bona fide” researchers. Bulk download is prohibited, and there’s a six-month delay between the time a tweet goes live on Twitter and when it’s made available through the Library of Congress.
Why might Twitter Inc be interested in throttling access to the Library of Congress and other search engines? A good guess is money. A recent piece of tech industry news peels back the curtain a bit. Twitter, reported the Daily Mail last week, has sold two years’ worth of tweets to the British social-data research firm DataSift. DataSift is said to have a thousand companies lined up for the tweets, eager to get access for business analysis purposes. That deal builds upon an earlier reseller agreement with the Colorado social-media data provider Gnip. The giant heap of tweets we’ve produced since 2006, when Twitter launched, is potentially the stuff from which great fortunes are made. The more people who have those tweets for free, the less valuable those tweets are.
Imagine an archival spectrum where on one end is a Googlesque stash of all tweets from all time and on the other is something akin to the Svalbard Global Seed Vault on Norway’s remote Spitsbergen island. Our tweets might be locked away, safe and sound, but nobody can get to them unless there’s an emergency. It’s an approach that worries the Internet Archive’s Brewster Kahle. “Access is a key to preservation,” he says. “It’s hard to get people motivated for keeping a dark archive robust.”
But even if we assume perfect searchability, there are still reasons to question a strategy where just Twitter, or just Twitter and the Library of Congress, are expected to maintain the tweet archive. “The Internet Archive is great,” says Kahle as he makes a comparison. “But you don’t want to have just one. The Library of Congress has books, but we don’t require it to have the book.” (The Internet Archive had been interested in getting a copy of the tweet archive from Twitter.)
Even before the Internet taught us about all the great advantages of distribution and decentralization, libraries were practicing it. The Library of Congress has some 35 million volumes. But its role (for everyone other than the United States Congress) is lender of last resort. The Library of Congress assumes a high degree of redundancy, a major player but simply one in a web that includes scores of personal, institutional, academic and other libraries.
There are some efforts, like the distributed update platform Identi.ca or The Locker Project personal data service, that are trying to offer users greater control over their digital information. But right now they’re limited to early adopters.
An alternative, then: What about outsourcing the work of keeping our personal contributions to the public record current and available? Local historical societies might step in to capture the tweet stream of a particular place or event as it happens in real time. Universities might provide students with an archival copy of their college-age tweets. Groups of people with a common interest might choose a provider to capture tweets around a theme in which they maintain a deep interest. With a distributed assortment of stewards of our social data (many of whom, of course, would outsource the technology side of things to tech firms), we’d have not one archive but multiple, redundant mini-archives.
Maybe I’d find my earliest use of the #fitn hashtag in the archives of the New Hampshire Republican Party, or the Union Leader, or Saint Anselm College’s New Hampshire Political Library. It might take a bit of legwork to find a tweet then. But it would be less work than counting on users to archive their own tweets and more productive than leaving it to Twitter to do. And so, we might soon see smaller archivers of all stripes step in, the local libraries of the social web.
Of course, an answer in “personal archiving” raises all sorts of new questions. Would it be necessary, or possible, or even desirable for those services to always be opt-in? Does the friction introduced by having a system of separate mini-caches mitigate privacy concerns of having tweets reaching far back? Is that a smart trade-off — limiting the searchability of the Web to maintain some notion of personal privacy — or should we just rip the bandage off and admit that we became fully public, in every sense of the word, the minute we posted something on Twitter?
Complicated, challenging questions. Also complicated and challenging: having an always-on global conversation among hundreds of million of people. But it falls short of the imaginativeness that is marking this moment in our technological evolution to simply leave all this archival business to just one company to figure out. “And that,” says Kahle, “is no way to run a culture.”