Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce, mail # user - Splitting input file - increasing number of mappers


+
parnab kumar 2013-07-06, 07:50
Copy link to this message
-
Re: Splitting input file - increasing number of mappers
Shumin Guo 2013-07-06, 15:56
You also need to pay attention to the split boundary, because you don’t
want to split one line to different mappers. May be you can think about
multi-line input format.

Simon.
On Jul 6, 2013 10:18 AM, "Sanjay Subramanian" <
[EMAIL PROTECTED]> wrote:

>  More mappers will make it faster
>      U can try this parameter
>       mapreduce.input.fileinputformat.split.maxsize=<sizeinbytes>
>      This will control the input split size and force more mappers to run
>
>
>  Also ur usecase seems good candidate for defining a Combiner because u r
> grouping keys based on a criteria
> But only gotcha is Combiners are  not guaranteed to be called to run
>
>  Give these shot
>
>  Good luck
>
>  sanjay
>
>
>
>   From: parnab kumar <[EMAIL PROTECTED]>
> Reply-To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Date: Saturday, July 6, 2013 12:50 AM
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Subject: Splitting input file - increasing number of mappers
>
>  Hi ,
>
>          I have an input file where each line is of the form :
>
>             <URL> <A NUMBER>
>
>        URLs whose number is within a threshold are considered similar. My
> task is to group together all similar urls. For this i wrote a *custom
> writable* where i implemented the threshold check in the *compareTo*method.Therefore when Hadoop sorts the similar urls are grouped
> together.This seems to work fine .
>       I have the following query :
>
>    1>   Since i am relying more on the sort feature provided by Hadoop, am
> i decreasing the efficiency in any way  or using Hadoops sort feature which
> hadoop does best  i am actually doing the right thing.Now if this is the
> right thing too , then it seems my job  mostly relies on the map
> task.Thefore will increase in the number of mappers increase efficiency ?
>
>       2> My file size is not more than 64 mb  i.e a Hadoop block size
> which means not more than 1 mapper will be invoked.Will splitting the file
> into smaller size increase the efficiency by invoking more mappers.
>
>  Can someone kindly provide some insight,advice regarding the above.
>
>  Thanks ,
> Parnab,
> MS student, IIT kharagpur
>
> CONFIDENTIALITY NOTICE
> =====================> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>