Wednesday, August 13, 2008

Using Hadoop

Since I've gotten back from OSCON, I've had a chance to use Hadoop at work. For those who aren't familiar with it, Hadoop is an open source framework for implementing map reduce jobs.

There are plenty of tutorials on Hadoop around the web, so I won't do any of the basic intro stuff. I wanted to write about some of the stuff I didn't find all that easily.

Most of the Hadoop documentation talks about dealing with input where your records are on a single line (like an Apache access log). From using the Google/documentation/experience, we have found that Hadoop works just fine with multi-line records. We're using Hadoop to process XML, specifically a dump from Wikipedia. The dump is a single 33 GB file, where there is a root tag, and then several million child tags (representing Wikipedia pages). Using this code I found on the Hadoop core user mailing list, we can have it so that the mapper gets the XML for one child node (or one Wikipedia page). This is nice, because the XML for a single page is relatively small. We then use JDOM to deal with the contents of the individual pages.

We are using HDFS to store our input and output. By default, it will chop files into 64MB chunks, which get shipped to the mappers so that the file can be processed in parallel. One thing that I was concerned about was how records would be handled that spanned the splits. So far, we haven't seen any issues, and this answer in the Hadoop FAQ seems to indicate that the records spanning the split will be handled. It may be possible that those records get dropped, it would be hard for us to tell at this point... but the good news for us is that it won't affect the data we're trying to collect much.

As for our setup and usage. We have 5 machines in our cluster, most of which are run of the mill dual core Intel machines running Linux. The jobs we're running on the 33GB XML file are taking around 45 minutes, which seems pretty fast to me.

12 comments:

dontcare said...

you may want to look at vtd-xml, the latest and most advanced xml processing api

vtd-xml

doglas markel said...

The monthly bills is sometimes start-up about eight to help you twelve bi-weekly commitments, or even simply just in case your loan is determined by your Social stability revenue, installment loans consequently you are able to simply arrange your monthly payments to urge a results of three to four month-to-month payments. often designed commitments facilitate to manufacture a lessened examine entire be the larger go beside your money.

sundara rami reddy said...

Thank for sharing this great Hadoop tutorials Blog post. I will use your command when upgrade hadoop.I get a lot of great information here and this is what I am searching for. Thank you for your sharing. I have bookmark this page for my future reference.
Hadoop Training in hyderabad

sundara rami reddy said...

Thank for sharing this great Hadoop tutorials Blog post. I will use your command when upgrade hadoop.I get a lot of great information here and this is what I am searching for. Thank you for your sharing. I have bookmark this page for my future reference.
Hadoop Training in hyderabad

sundara rami reddy said...

Thank for sharing this great Hadoop tutorials Blog post. I will use your command when upgrade hadoop.I get a lot of great information here and this is what I am searching for. Thank you for your sharing. I have bookmark this page for my future reference.
Hadoop Training in hyderabad

sundara rami reddy said...

Thank for sharing this great Hadoop tutorials Blog post. I will use your command when upgrade hadoop.I get a lot of great information here and this is what I am searching for. Thank you for your sharing. I have bookmark this page for my future reference.
Hadoop Training in hyderabad

sundara rami reddy said...

Thank for sharing this great Hadoop tutorials Blog post. I will use your command when upgrade hadoop.I get a lot of great information here and this is what I am searching for. Thank you for your sharing. I have bookmark this page for my future reference.
Hadoop Training in hyderabad

sundara rami reddy said...

Thank for sharing this great Hadoop tutorials Blog post. I will use your command when upgrade hadoop.I get a lot of great information here and this is what I am searching for. Thank you for your sharing. I have bookmark this page for my future reference.
Hadoop Training in hyderabad

sundara rami reddy said...

Thank for sharing this great Hadoop tutorials Blog post. I will use your command when upgrade hadoop.I get a lot of great information here and this is what I am searching for. Thank you for your sharing. I have bookmark this page for my future reference.
Hadoop Training in hyderabad

sundara rami reddy said...

Thank for sharing this great Hadoop tutorials Blog post. I will use your command when upgrade hadoop.I get a lot of great information here and this is what I am searching for. Thank you for your sharing. I have bookmark this page for my future reference.
Hadoop Training in hyderabad

sundara rami reddy said...

Thank for sharing this great Hadoop tutorials Blog post. I will use your command when upgrade hadoop.I get a lot of great information here and this is what I am searching for. Thank you for your sharing. I have bookmark this page for my future reference.
Hadoop Training in hyderabad

sundara rami reddy said...

Thank for sharing this great Hadoop tutorials Blog post. I will use your command when upgrade hadoop.I get a lot of great information here and this is what I am searching for. Thank you for your sharing. I have bookmark this page for my future reference.
Hadoop Training in hyderabad