Wednesday, August 13, 2008

Using Hadoop

Since I've gotten back from OSCON, I've had a chance to use Hadoop at work. For those who aren't familiar with it, Hadoop is an open source framework for implementing map reduce jobs.

There are plenty of tutorials on Hadoop around the web, so I won't do any of the basic intro stuff. I wanted to write about some of the stuff I didn't find all that easily.

Most of the Hadoop documentation talks about dealing with input where your records are on a single line (like an Apache access log). From using the Google/documentation/experience, we have found that Hadoop works just fine with multi-line records. We're using Hadoop to process XML, specifically a dump from Wikipedia. The dump is a single 33 GB file, where there is a root tag, and then several million child tags (representing Wikipedia pages). Using this code I found on the Hadoop core user mailing list, we can have it so that the mapper gets the XML for one child node (or one Wikipedia page). This is nice, because the XML for a single page is relatively small. We then use JDOM to deal with the contents of the individual pages.

We are using HDFS to store our input and output. By default, it will chop files into 64MB chunks, which get shipped to the mappers so that the file can be processed in parallel. One thing that I was concerned about was how records would be handled that spanned the splits. So far, we haven't seen any issues, and this answer in the Hadoop FAQ seems to indicate that the records spanning the split will be handled. It may be possible that those records get dropped, it would be hard for us to tell at this point... but the good news for us is that it won't affect the data we're trying to collect much.

As for our setup and usage. We have 5 machines in our cluster, most of which are run of the mill dual core Intel machines running Linux. The jobs we're running on the 33GB XML file are taking around 45 minutes, which seems pretty fast to me.