Wednesday, August 13, 2008

Using Hadoop

Since I've gotten back from OSCON, I've had a chance to use Hadoop at work. For those who aren't familiar with it, Hadoop is an open source framework for implementing map reduce jobs.

There are plenty of tutorials on Hadoop around the web, so I won't do any of the basic intro stuff. I wanted to write about some of the stuff I didn't find all that easily.

Most of the Hadoop documentation talks about dealing with input where your records are on a single line (like an Apache access log). From using the Google/documentation/experience, we have found that Hadoop works just fine with multi-line records. We're using Hadoop to process XML, specifically a dump from Wikipedia. The dump is a single 33 GB file, where there is a root tag, and then several million child tags (representing Wikipedia pages). Using this code I found on the Hadoop core user mailing list, we can have it so that the mapper gets the XML for one child node (or one Wikipedia page). This is nice, because the XML for a single page is relatively small. We then use JDOM to deal with the contents of the individual pages.

We are using HDFS to store our input and output. By default, it will chop files into 64MB chunks, which get shipped to the mappers so that the file can be processed in parallel. One thing that I was concerned about was how records would be handled that spanned the splits. So far, we haven't seen any issues, and this answer in the Hadoop FAQ seems to indicate that the records spanning the split will be handled. It may be possible that those records get dropped, it would be hard for us to tell at this point... but the good news for us is that it won't affect the data we're trying to collect much.

As for our setup and usage. We have 5 machines in our cluster, most of which are run of the mill dual core Intel machines running Linux. The jobs we're running on the 33GB XML file are taking around 45 minutes, which seems pretty fast to me.

41 comments:

anon_anon said...

you may want to look at vtd-xml, the latest and most advanced xml processing api

vtd-xml

Unknown said...

The monthly bills is sometimes start-up about eight to help you twelve bi-weekly commitments, or even simply just in case your loan is determined by your Social stability revenue, installment loans consequently you are able to simply arrange your monthly payments to urge a results of three to four month-to-month payments. often designed commitments facilitate to manufacture a lessened examine entire be the larger go beside your money.

mareddyonline said...

Thank for sharing this great Hadoop tutorials Blog post. I will use your command when upgrade hadoop.I get a lot of great information here and this is what I am searching for. Thank you for your sharing. I have bookmark this page for my future reference.
Hadoop Training in hyderabad

mareddyonline said...

Thank for sharing this great Hadoop tutorials Blog post. I will use your command when upgrade hadoop.I get a lot of great information here and this is what I am searching for. Thank you for your sharing. I have bookmark this page for my future reference.
Hadoop Training in hyderabad

mareddyonline said...

Thank for sharing this great Hadoop tutorials Blog post. I will use your command when upgrade hadoop.I get a lot of great information here and this is what I am searching for. Thank you for your sharing. I have bookmark this page for my future reference.
Hadoop Training in hyderabad

mareddyonline said...

Thank for sharing this great Hadoop tutorials Blog post. I will use your command when upgrade hadoop.I get a lot of great information here and this is what I am searching for. Thank you for your sharing. I have bookmark this page for my future reference.
Hadoop Training in hyderabad

Unknown said...

This is a very nice article and gives in-depth information. I am so happy to read this.
Packaging Design Canada
Packaging Design Toronto
Graphic Design in Canada

Unknown said...

Great Post! It’s a watch gap to all and I’m a great deal appreciative that she wrote her expertise for us to let recognize them severally.

Visit : GED Online

Unknown said...

Nice piece of article you have shared here, my dream of becoming a hadoop professional become true with the help of Hadoop Course in Chennai, keep up your good work of sharing quality articles.




big data training in velachery|hadoop
training chennai velachery
|hadoop training institute in t nagar

Unknown said...

Understudy credit combination is basically considered as an instrument to deal with one or more obligations. Such a credit additionally permits any understudy to consolidate his/her government or private understudy credits into one single home loan with augmented advance terms, which accordingly minimize the regularly scheduled installment. payday loan newport_news

Unknown said...

New business advances as their name propose are implied for the individuals who need fund for new company. Having a business thought is insufficient to begin new business. Nowadays discovering money is made less demanding by new business advances. Any sort of monetary necessity can without much of a stretch paid for by new business credits. However without fitting arrangement and data getting new business credits may be a bit extreme. Thus, get your certainties right and move your business thought in the right course. Payday Loans San-diego

Unknown said...

Is it correct to say that you are going to go to a car title advance moneylender for extra cash? Is the cash a need or for something extra? The title credit improvement master is not going to stop and sales those sorts from requesting, it is not a progression principal. Banks and credit unions will oblige this information, as their rationale is hugely different. car title loans chicago

Unknown said...

Title advances have the same highlights as a secured improvement, adjacent to a solitary viewpoint. While secured credits don't portray the sort of insurance that will suffice it, title moves particularly oblige cars or whatever other vehicle to go about as security. www.usacheckcashingstore.com/san-diego

MUSA77 said...

Consider these terms, and perform your adjustment of poor recognition motorbike advances as adumbrated by your bread-and-margarine achievability and spending course of action viability.In purchasing a vehicle, you may charge to get to limitless admeasurement of banknote in appearance of ethereal costs of the hurt for vehicle. car title loans near me chicago

Unknown said...

At the point aback engaging for motorbike cash, there are an alloyed sack of edges you'll charge to consider. These will hold the expenses choices open, adulation term, and upkeep rates. Check Cashing San-diego

JON said...

Imagine taking without end with the trade for trade your take in for spendable blend what could be not completely 60 minutes! Vivacious money is clear when you use a vehicle title credit.https://www.usacheckcashingstore.com/san-diego

Unknown said...

Payday advances may be opened up on the off chance that you wind up in a position to be not dealt with to fulfill all or some bit of the entire due on the reimbursement date. On the off chance that this happens it is suggested that you contact your payday advance supplier at the soonest open passageway and uncover your circumstances to them. They will then have the capacity to light up your decisions and how to build up your credit. Auto Title Loans Chicago

Unknown said...

More generally than not, these credits are arranged into secured and detached structure, with the longing that you can get to the sustenance as showed up by your charge and need.
Cash Advance San-diego

Unknown said...

In you accepted job the affairs are there are bodies you like and bodies you don't like but either way you apperceive their moods and quirks and they apperceive yours. While it may not actualize a altogether adapted assignment ambiance it does accredit you to cross assignment backroom a lot easier than if you were the newbie.
Payday Loans Chicago

Unknown said...

The key aback authoritative paycheck loans is the appropriate advice apropos the lending abundance you will be borrowing from and you accept to accomplish abiding that you absolutely charge the money. If you do not absolutely charge the money again you can aloof delay for your abutting paycheck.
cash advance corona

Unknown said...

Over the accomplished few years, banks and architecture societies accept anchored their lending behavior to such an admeasurement that there are around no apart loans accessible at the moment. Consumers accept accordingly looked for added apart borrowing and appropriately we now accept almanac levels of acclaim agenda debt.
check cashing in fresno ca

Justin said...

On the off chance that your arrangement does not work, find different systems in showcasing. Stay informed concerning the opposition in the business. This will help you make arrangements and procedures that will permit you to make your business develop. Check Cashing San-diego

Justin said...

The issue with PPI protection cases is they have a tendency to be destroyed progressively when contrasted and other protection arrangements. This happens since they're not supported and they're generally taken on the grounds that they are without the customer tirelessly assessing on their advantages to the general arrangement. auto title loans chicago

Justin said...

I would read to him constantly, particularly when he required ameliorating, or to take a sleep, amid the day. I would quiet him to rest by creating languid time stories, about his most loved pretend companions, Rainbolt and Tussy. cash advance chicago

Paulo said...

Stock - A bank may progress up to 60 percent to 80 percent of significant worth for prepared to-go retail stock. A maker's stock, comprising of segment parts and other unfinished materials, may be just 30 percent. Cash Advance Chicago

Paulo said...

We are not discussing a $3 million credit line just to show capital on an asset report. Furthermore, we are not discussing a $250,000 gear credit for a local development organization. cash advance

Alex daina said...

This is valid for all parts of our lives, even in the region of our funds. Liquidating checks, for instance, is a considerable measure less demanding these days than it was previously. Check Cashing

Krish said...

realy nice...
Python Internship
Dotnet Internship
Java Internship
Web Design Internship
Php Internship
Android Internship
Big Data Internship
Cloud Internship
Hacking Internship
Robotics Internship

Krish said...

this is realy usefull..
Oracle Internship
R Programming Internship
CCNA Internship
Networking Internship
Artificial Intelligence Internship
Machine Learning Internship
Blockchain Internship
Sql Server Internship
Iot Internship
Data Science Internship

Krish said...

good post....
Selenium Testing Internship
Linux Internship
C Internship
CPP Internship
Embedded System Internship
Matlab Internship

RONOLD said...

GOODPOST
hacking course
internship for it students
ccna course chennai
civil engineering internship report pdf india
kashi infotech
internships in hyderabad for cse students 2018
cse internships in hyderabad
inplant training for diploma students
internship in hyderabad for cse students


RONOLD said...

GOOD
nodejs while loop
icici bank po interview questions and answers pdf
craterzone aptitude test
zensoft recruitment process
java developer resume 1 years experience
python developer resume pdf
infrrd private limited interview questions
js int max value
delete * from table oracle
t systems pune aptitude questions

Arun vijay said...

good post

Javascript Maximum Integer
INT MAX Javascript
Acceptance is to an Offer What a Lighted Match is to a Train of Gunpowder
Who Can Issue Character Certificate
Technical Support Resume DOC
PHP Developer Resume For 3 Year Experience
Wapda Interview Questions
Power BI Resume Download
a Dishonest Dealer Professes to Sell His Goods at a Profit of 20
Failed to Find 'Android_Home' Environment Variable. TRY Setting it Manually

Arun vijay said...
This comment has been removed by the author.
Arun vijay said...
This comment has been removed by the author.
Bluebase said...
This comment has been removed by the author.
subha said...

Thank you for that valuable post. Fresher’s have struggle to learn web design and developement applications in this post guide that students and give more extended knowledge of web technology. nice page.
Ai & Artificial Intelligence Course in Chennai
PHP Training in Chennai
Ethical Hacking Course in Chennai Blue Prism Training in Chennai
UiPath Training in Chennai

Anonymous said...

Really it was an awesome article...very interesting to read..You have provided an nice article....Thanks for sharing.

Big Data Hadoop Training In Chennai | Big Data Hadoop Training In anna nagar | Big Data Hadoop Training In omr | Big Data Hadoop Training In porur | Big Data Hadoop Training In tambaram | Big Data Hadoop Training In velachery

Sowmiya R said...

Now it is known to me that articles is nothing but inspiring is everything to do something great. This is a great article for the people who want to come in freelancing.
Oracle Training | Online Course | Certification in chennai | Oracle Training | Online Course | Certification in bangalore | Oracle Training | Online Course | Certification in hyderabad | Oracle Training | Online Course | Certification in pune | Oracle Training | Online Course | Certification in coimbatore

Links For You said...

this is a particularly wonderful helpful asset which you are offering and you find the money for it away for justifiable. I truly like seeing online journal that arrangement the expense of providing an energies valuable asset for pardon. thanks! Office 2019 Crack

trublogger said...

this is my most memorable end up antiquated I visit here. I found for that excuse numerous appealing stuff in your weblog explicitly its wind current. From the stores of criticism vis- - vis your articles, I bet I'm not the unmarried-handedly one having all the happiness here shop happening the affable do thanks! Malwarebytes Crack