*I should probably put a massive disclaimer on this post noting that this is just a crackpot idea I came up with, I haven’t actually tested this (nor do I plan to) and there are probably a ton of real-world issues with this approach, both technical as well as legal. So, yeah – don’t try this at home. *

**A simplistic refresher on compression**

*(Editors note: as my very smart commenters have pointed out, actually *compressing* data using Twitter doesn’t make any sense. This post is about *encoding* data using Twitter. This part is just a refresh on the basic concepts, feel free to skip ahead if you already know this stuff.)*

It’s been a couple of years since my computer science days, but I do remember the basics of an simple compression algorithm. First, remember that all digital files are made up of a series of 1’s and 0’s. Let’s take this pattern as an example:

0010 0100 0001 0000 0100 0001 001

The key to this compression algorithm is to find long repeating patterns, and use a lookup table to replace them with shorter ones. Let’s take another look at our example, with spaces inserted to highlight the patterns:

001 001 000001 000001 000001 001

Now, let’s make a quick lookup table:

Pattern | Replacement |

001 | A |

000001 | B |

We can now represent those bits as follows:

AABBBA

And there we go! Our 27 characters were replaced with 6 characters, and even adding in the extra space for the lookup table, it’s still significantly smaller.

Of course, this is just for demo purposes only – visit your local library to learn how the big boys do it.

**So what does this have to do with Twitter?**

Typically, compression algorithms have been used to transform a fixed amount of local information into a smaller amount of local information. However, Twitter gives use something entirely different: a non-local store of information. By encoding our local information into Twitter’s non-local store, we can massively decrease the amount of local information we need to keep track of.

Let’s take a look at our old example lookup table:

Pattern | Replacement |

001 | A |

000001 | B |

Now, let’s add a new column, where instead of the replacement being a letter, let’s pull in two random Tweets from the public timeline:

Pattern | Replacement | Tweet |

001 | A | I am bored |

000001 | B | Good morning! |

Using the same approach as before, let’s encode the original bits using our new Tweet replacement scheme:

AABBBA –> I am boredIamboredGood morning!Good morning!Good morning!Iambored

Wow…that really kinda sucks. But, stick with me here for a second. There’s an important point that we’re missing here, and that is the fact that Twitter assigns a sequential ID to each Tweet. Let’s pretend for a second that the public timeline looked like this:

ID | Tweet |

10000 | I am bored |

10001 | I am bored |

10002 | Good morning! |

10003 | Good morning! |

10004 | Good morning! |

10005 | I am bored |

Aha! Now we can do something interesting. Instead of replacing the letters with Tweets, we can instead assign a starting ID, and the number of Tweets to read:

AABBBA –> 10000-6

In other words, start at Tweet ID# 10000, and read 6 sequential Tweets. With our simple example, it doesn’t look like you’re saving much space, but hypothetically you could replace thousands and thousands of 1’s and 0’s with a starting ID and the number of Tweets to read.

**Kevin – you are an idiot. The odds of that exact pattern coming up are a bazillion to one!**

I know what you’re thinking, and you’re right. I haven’t done the math, but I’m guessing the probably of a series of random Tweets appearing in sequential order that exactly matches the pattern we need for encoding is a number very close to zero. But wait, there’s more!

Remember, the beauty of Twitter is that each Tweet has a large amount of information attached to it. We can use as much or as little of that info to greatly increase the probability of finding a matching pattern. For example, let’s take our table, but instead of using the full text of the Tweet, we only use the first letter of each Tweet:

ID | First letter of Tweet |

10000 | I |

10001 | I |

10002 | G |

10003 | G |

10004 | G |

10005 | I |

Or, let’s look at the number of letters in each Tweet:

ID | Length of Tweet |

10000 | 10 |

10001 | 10 |

10002 | 13 |

10003 | 13 |

10004 | 13 |

10005 | 10 |

And these are just patterns I’m using as examples to make it easy for us humans to read. A computer armed with a neural net could easily discover many more patterns.

And remember, there are a massive number of Tweets created each day. I just pulled the latest ID from the Twitter public timeline, and assuming they are created in sequential order, that means there are a total of 8,156,003,416 Tweets that have been created! With that much information, the odds of being able to find a matching pattern suddenly become much more reasonable.

There are other ways to lower the odds as well. For example, you wouldn’t necessarily need to find an exact pattern that is thousands of Tweets long. Let’s say that you needed to find a pattern that is 50 Tweets. You could break it up into smaller chunks, like this:

10000-6

20010-19

33990-25

Now instead of 50 Tweets in a row, we only needed to find a sequence of 6 Tweets, another sequence of 19 Tweets, and a third sequence of 25 Tweets.

And remember – Twitter is pumping out more and more information each day. This dynamic quality to Twitter means that a new pattern could be discovered that would replace an old pattern. For example, maybe for the above example, a new pattern was found that contained all 50 Tweets in one sequence! You could then replace the original encoding:

10000-6

20010-19

33990-25

…with this encoding:

510000-50

**So what?**

OK – so theoretically it’s possible to encode bits by using patterns found in Twitter. We already have .mp3 files to play songs, why do we need this?

#1 – Instead of having 10,000 3 MB files filling up your hard drive, you could fit your entire music collection into one text file that is 10,000 lines long.

#2 – There’s another very, very good reason for encoding music (or other types of media) like this…but let’s just say that the RIAA would probably not be terribly happy about an approach like this. ‘Cause it’s pretty hard to sue people for distributing music if all they are doing is talking about what they are eating for lunch on a massively popular public site.

*What do you guys think of this idea? Completely crazy? Or something that might actually be possible (especially once Twitter opens up their API firehose)? Let me know in the comments or follow me on Twitter at @astartupaday.*

*COMPLETELY CRAZY*. In fact, I can mathematically prove to you that your concept cannot possibly work, at all, ever.

If you want to look into it beyond this comment, I suggest you lookup the “Pigeon Hole Principle”, as well as find any nutcase compression algorithm based on finding any piece of data’s numerical representation in the processing of the digits of pi (which is mathematically proven to contain every number provided you expand its digits long enough). You run into the exact same issue you are going to run into.

Simply put:

You can’t use X bits to represent more than at most 2^X different cases.

Let’s set X to 8 for a moment. If I have a compression algorithm that can take *any* data that is 9 bits long and reduce it to 8 bits, then we can clearly say this compression algorithm is as impossible as 2+2 being equal to 3. After all, the compressed data has 2^8 = 256 different ‘pigeon holes’, but you’ve got 2^9 = 512 different data files. You can’t fit 2 pigeons in one hole, or your compression scheme would be arbitrary and a compressed file can be decompressed into 2 different files, which would make it useless.

The key, then, is to prove that a compression scheme cannot compress random data just as well as non-random data, for if you can prove that, you know the pigeon hole principle will apply and you haven’t got a compression scheme at all, just one to reshuffle bits. And, usually, this mapping function actually serves to make the ‘compressed’ data *BIGGER* than the original, as most of these mapping schemes, be it twitter or the digits of pi, contain redundancy.

Something like the scheme you started this post out with can be proven to spectacularly fail on random data. It specifically exploits repetition. Data with no repetition at all should not compress (in fact, to accomodate the table, it would become LARGER. This is a key factor in avoiding the pigeon hole principle – this HAS to be true for any compression algorithm, just like 2 + 2 has to be 4).

In regards to the RIAA trick: That won’t fly either, and there are far easier ways to accomplish this.

Example scheme: Take a movie. XOR it with the compressed complete repository of debian linux. Without the debian linux kernel this data is utter random jibberjabber – in fact, you can take the same seemingly random stream of data and, provided you XOR it with the right stream of other data, get anything you want with that length. And yet, with one particular secondary input source (debian linux repo, which is free and open source), it ‘magically’ turns into some copyrighted movie.

It’s been tried. It’s sufficiently nuanced that a judge will either not understand it and go back to the simple lemma that by downloading this thing you can trivially get the copyrighted movie out of it, and even if a judge does understand the math involved, he’ll see it as a violation of the spirit of the law, which in cases like this is more than enough.

Hence, we arrive back at where I started: This entire post was *COMPLETELY CRAZY*.

“I haven’t done the math”

And if you would, you would find out that, on average, you’d have download 3 MB in tweets to uniquely represent a 3MB mp3 file.

Secondly, you’d have to search an infinite time to find a tweet sequence representing your 3MB file.

Hey Reinier,

This is why I love my blog – smart people taking time out of their day to debunk my crazy ideas! One important thing to note, and I should have clarified this in my post, is that by no means was this intended as a *compression* algorithm- this is an *encoding* algorithm.

The metaphor I should have used is more like this. Think about encoding a secret message in a book by using the first letter of each paragraph. To encode the message “This idea is crazy”, I would need to find a page in a book where the first paragraph starts with “T”, the second starts with “H”, and so on…

Now, if I want to send this message to someone, I just need to tell them where in the book to start. (i.e. War and Peace, page 217, paragraph 3). While the full book of ‘War and Peace’ is much, much larger than my message, I can send my message just by giving these simple instructions.

In my mind, where this idea breaks down is this question:

What is the mathmatical probability that for longer messages you could find a book that has the exact letters laid out in sequential paragraphs?

For an arbitrarily small input (a single letter, for example) the probability is 1. For an arbitrarily large input (20 billion letters), the probability is very close to 0. The real question is, does that probability drop fast enough that encoding anything useful is not practical?

My guess? If you actually did the math, the probability of doing anything useful would likely be very, very small. Small enough to make this whole debate moot.

Your legal argument also probably holds water with most courts – which is why this becomes more of a hypothetical theory rather than something any sane person would set out to do. 🙂

A close cousin is STEGANOGRAPHY, the art of hiding things in plain sight.

http://www.economist.com/node/17248872