I should probably put a massive disclaimer on this post noting that this is just a crackpot idea I came up with, I haven’t actually tested this (nor do I plan to) and there are probably a ton of real-world issues with this approach, both technical as well as legal. So, yeah – don’t try this at home.
A simplistic refresher on compression
(Editors note: as my very smart commenters have pointed out, actually *compressing* data using Twitter doesn’t make any sense. This post is about *encoding* data using Twitter. This part is just a refresh on the basic concepts, feel free to skip ahead if you already know this stuff.)
It’s been a couple of years since my computer science days, but I do remember the basics of an simple compression algorithm. First, remember that all digital files are made up of a series of 1’s and 0’s. Let’s take this pattern as an example:
0010 0100 0001 0000 0100 0001 001
The key to this compression algorithm is to find long repeating patterns, and use a lookup table to replace them with shorter ones. Let’s take another look at our example, with spaces inserted to highlight the patterns:
001 001 000001 000001 000001 001
Now, let’s make a quick lookup table:
We can now represent those bits as follows:
And there we go! Our 27 characters were replaced with 6 characters, and even adding in the extra space for the lookup table, it’s still significantly smaller.
Of course, this is just for demo purposes only – visit your local library to learn how the big boys do it.
So what does this have to do with Twitter?
Typically, compression algorithms have been used to transform a fixed amount of local information into a smaller amount of local information. However, Twitter gives use something entirely different: a non-local store of information. By encoding our local information into Twitter’s non-local store, we can massively decrease the amount of local information we need to keep track of.
Let’s take a look at our old example lookup table:
Now, let’s add a new column, where instead of the replacement being a letter, let’s pull in two random Tweets from the public timeline:
|001||A||I am bored|
Using the same approach as before, let’s encode the original bits using our new Tweet replacement scheme:
AABBBA –> I am boredIamboredGood morning!Good morning!Good morning!Iambored
Wow…that really kinda sucks. But, stick with me here for a second. There’s an important point that we’re missing here, and that is the fact that Twitter assigns a sequential ID to each Tweet. Let’s pretend for a second that the public timeline looked like this:
|10000||I am bored|
|10001||I am bored|
|10005||I am bored|
Aha! Now we can do something interesting. Instead of replacing the letters with Tweets, we can instead assign a starting ID, and the number of Tweets to read:
AABBBA –> 10000-6
In other words, start at Tweet ID# 10000, and read 6 sequential Tweets. With our simple example, it doesn’t look like you’re saving much space, but hypothetically you could replace thousands and thousands of 1’s and 0’s with a starting ID and the number of Tweets to read.
Kevin – you are an idiot. The odds of that exact pattern coming up are a bazillion to one!
I know what you’re thinking, and you’re right. I haven’t done the math, but I’m guessing the probably of a series of random Tweets appearing in sequential order that exactly matches the pattern we need for encoding is a number very close to zero. But wait, there’s more!
Remember, the beauty of Twitter is that each Tweet has a large amount of information attached to it. We can use as much or as little of that info to greatly increase the probability of finding a matching pattern. For example, let’s take our table, but instead of using the full text of the Tweet, we only use the first letter of each Tweet:
|ID||First letter of Tweet|
Or, let’s look at the number of letters in each Tweet:
|ID||Length of Tweet|
And these are just patterns I’m using as examples to make it easy for us humans to read. A computer armed with a neural net could easily discover many more patterns.
And remember, there are a massive number of Tweets created each day. I just pulled the latest ID from the Twitter public timeline, and assuming they are created in sequential order, that means there are a total of 8,156,003,416 Tweets that have been created! With that much information, the odds of being able to find a matching pattern suddenly become much more reasonable.
There are other ways to lower the odds as well. For example, you wouldn’t necessarily need to find an exact pattern that is thousands of Tweets long. Let’s say that you needed to find a pattern that is 50 Tweets. You could break it up into smaller chunks, like this:
Now instead of 50 Tweets in a row, we only needed to find a sequence of 6 Tweets, another sequence of 19 Tweets, and a third sequence of 25 Tweets.
And remember – Twitter is pumping out more and more information each day. This dynamic quality to Twitter means that a new pattern could be discovered that would replace an old pattern. For example, maybe for the above example, a new pattern was found that contained all 50 Tweets in one sequence! You could then replace the original encoding:
…with this encoding:
OK – so theoretically it’s possible to encode bits by using patterns found in Twitter. We already have .mp3 files to play songs, why do we need this?
#1 – Instead of having 10,000 3 MB files filling up your hard drive, you could fit your entire music collection into one text file that is 10,000 lines long.
#2 – There’s another very, very good reason for encoding music (or other types of media) like this…but let’s just say that the RIAA would probably not be terribly happy about an approach like this. ‘Cause it’s pretty hard to sue people for distributing music if all they are doing is talking about what they are eating for lunch on a massively popular public site.
What do you guys think of this idea? Completely crazy? Or something that might actually be possible (especially once Twitter opens up their API firehose)? Let me know in the comments or follow me on Twitter at @astartupaday.