-Re: Splitting input file - increasing number of mappers
Sanjay Subramanian 2013-07-06, 15:18
More mappers will make it faster
U can try this parameter
This will control the input split size and force more mappers to run
Also ur usecase seems good candidate for defining a Combiner because u r grouping keys based on a criteria
But only gotcha is Combiners are not guaranteed to be called to run
Give these shot
From: parnab kumar <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Reply-To: "[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Date: Saturday, July 6, 2013 12:50 AM
To: "[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Subject: Splitting input file - increasing number of mappers
I have an input file where each line is of the form :
<URL> <A NUMBER>
URLs whose number is within a threshold are considered similar. My task is to group together all similar urls. For this i wrote a custom writable where i implemented the threshold check in the compareTo method.Therefore when Hadoop sorts the similar urls are grouped together.This seems to work fine .
I have the following query :
1> Since i am relying more on the sort feature provided by Hadoop, am i decreasing the efficiency in any way or using Hadoops sort feature which hadoop does best i am actually doing the right thing.Now if this is the right thing too , then it seems my job mostly relies on the map task.Thefore will increase in the number of mappers increase efficiency ?
2> My file size is not more than 64 mb i.e a Hadoop block size which means not more than 1 mapper will be invoked.Will splitting the file into smaller size increase the efficiency by invoking more mappers.
Can someone kindly provide some insight,advice regarding the above.
MS student, IIT kharagpur
=====================This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.