Tuesday, September 8, 2015

Common Crawl

Have you heard of the Common Crawl?

It's a non-profit that crawls the World Wide Web each month. They then publish a free, public archive of what they've found. The August 2015 crawl contains 1,810,000,000 webpages totalling 145 TB in size. For perspective, 22 years ago, my family's first computer (hereafter, the DellowPC [1]) had 170 MB of hard drive space -- so we'd need 850,000 of those $2,500 [2] machines just to store the August results.

Do you have 2.1 billion dollars to spare? Yeah, me neither. That's a shame, because since hearing of the Common Crawl, I really wanted to experiment with it.

Luckily, you don't actually need a few billion dollars. A quarter and a nickel will do. Because Amazon hosts the dataset in their S3 file server [3], other Amazon customers can work with the data very cheaply by using Amazon's EC2 service, which rents computers by the hour.

I wrote a script to read every page and spit out the list of links, sorted by popularity. The Common Crawl is conveniently divided into about 34,000 "chunks", each containing 60,000 web pages. You end up with 34,000 lists of links. Then you pair up each list and merge them together, aggregating their popularity counts. You end up with 17,000 lists of links. Rinse, wash, repeat until you have 1 list of links.

This, of course, requires some computing power. Remember that quarter and nickel? For that amount, you can rent a really beefy computer for an hour. The computer has 36 "cores", each core having about 100x the processing power of a DellowPC.

So what the hell, let's splash out for three of them, for a combined processing power of about 10,800 DellowPCs. All for less than $1, or 0.05% the cost of the DellowPC. Running at full tilt (see below), the scripts take about a day and half to run. Total cost: about $30.


Few things about technology really make me awestruck, but this is one of them.

The 1.8 billion pages contain the thoughts of hundreds of millions of people. And they're available for your perusal if you have a few bucks and some minimal scripting skills.

This used to be the sole domain of megacorporations. While you're not likely to build the next Google on the Common Crawl alone, there are plenty of other things you could do with it. It also raises the question of how enforceable things like Europe's "Right to be Forgotten" law will be when any individual can inspect the source documents--all several billion of them--on their own.

For an element of frisson, I inspected my results for references to me or my siblings. We have fairly unique names and since I know that 1.8 billion pages is just a fraction of the entire Web, I wasn't expecting much. Here's what I found:

11 colin dellow
19 jessica dellow
16 lindsay dellow
257 tyler dellow

This is straight up creepy! Even though this is clearly not a complete sample of the web, here is information about me and my siblings--and hundreds of millions of other people--waiting to be analyzed, with no oversight. (All the Common Crawl people know is that you looked at the entire data set. For $30.)

Can you imagine what will be possible in another 22 years? Or what might happen if, say, the NSA's data warehouse ever got breached and released to the world? We're all going to become a lot more comfortable with each other's deep, dark secrets out of necessity, because they're eventually going to be available to everyone.

[1] The DellowPC was an IBM PS/1 with an Intel 80486SX running at 33MHz with 4MB of RAM. It looked like this.

[2] $2,500 in 1993 dollars. $3,717 after adjusting for inflation, but things are a bit one-sided as it is.

[3] The total Common Crawl dataset is 2.7 petabytes, encompassing 44 billion pages. Monthly storage cost on S3: $85,000. Amazon donates the storage for free.

No comments:

Post a Comment