parnab kumar 2013-07-06, 07:50
You also need to pay attention to the split boundary, because you don’t
want to split one line to different mappers. May be you can think about
multi-line input format.
On Jul 6, 2013 10:18 AM, "Sanjay Subramanian" <
[EMAIL PROTECTED]> wrote:
> More mappers will make it faster
> U can try this parameter
> This will control the input split size and force more mappers to run
> Also ur usecase seems good candidate for defining a Combiner because u r
> grouping keys based on a criteria
> But only gotcha is Combiners are not guaranteed to be called to run
> Give these shot
> Good luck
> From: parnab kumar <[EMAIL PROTECTED]>
> Reply-To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Date: Saturday, July 6, 2013 12:50 AM
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Subject: Splitting input file - increasing number of mappers
> Hi ,
> I have an input file where each line is of the form :
> <URL> <A NUMBER>
> URLs whose number is within a threshold are considered similar. My
> task is to group together all similar urls. For this i wrote a *custom
> writable* where i implemented the threshold check in the *compareTo*method.Therefore when Hadoop sorts the similar urls are grouped
> together.This seems to work fine .
> I have the following query :
> 1> Since i am relying more on the sort feature provided by Hadoop, am
> i decreasing the efficiency in any way or using Hadoops sort feature which
> hadoop does best i am actually doing the right thing.Now if this is the
> right thing too , then it seems my job mostly relies on the map
> task.Thefore will increase in the number of mappers increase efficiency ?
> 2> My file size is not more than 64 mb i.e a Hadoop block size
> which means not more than 1 mapper will be invoked.Will splitting the file
> into smaller size increase the efficiency by invoking more mappers.
> Can someone kindly provide some insight,advice regarding the above.
> Thanks ,
> MS student, IIT kharagpur
> CONFIDENTIALITY NOTICE
> =====================> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.