Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Splitting input file - increasing number of mappers


Copy link to this message
-
Re: Splitting input file - increasing number of mappers
You also need to pay attention to the split boundary, because you don’t
want to split one line to different mappers. May be you can think about
multi-line input format.

Simon.
On Jul 6, 2013 10:18 AM, "Sanjay Subramanian" <
[EMAIL PROTECTED]> wrote:

>  More mappers will make it faster
>      U can try this parameter
>       mapreduce.input.fileinputformat.split.maxsize=<sizeinbytes>
>      This will control the input split size and force more mappers to run
>
>
>  Also ur usecase seems good candidate for defining a Combiner because u r
> grouping keys based on a criteria
> But only gotcha is Combiners are  not guaranteed to be called to run
>
>  Give these shot
>
>  Good luck
>
>  sanjay
>
>
>
>   From: parnab kumar <[EMAIL PROTECTED]>
> Reply-To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Date: Saturday, July 6, 2013 12:50 AM
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Subject: Splitting input file - increasing number of mappers
>
>  Hi ,
>
>          I have an input file where each line is of the form :
>
>             <URL> <A NUMBER>
>
>        URLs whose number is within a threshold are considered similar. My
> task is to group together all similar urls. For this i wrote a *custom
> writable* where i implemented the threshold check in the *compareTo*method.Therefore when Hadoop sorts the similar urls are grouped
> together.This seems to work fine .
>       I have the following query :
>
>    1>   Since i am relying more on the sort feature provided by Hadoop, am
> i decreasing the efficiency in any way  or using Hadoops sort feature which
> hadoop does best  i am actually doing the right thing.Now if this is the
> right thing too , then it seems my job  mostly relies on the map
> task.Thefore will increase in the number of mappers increase efficiency ?
>
>       2> My file size is not more than 64 mb  i.e a Hadoop block size
> which means not more than 1 mapper will be invoked.Will splitting the file
> into smaller size increase the efficiency by invoking more mappers.
>
>  Can someone kindly provide some insight,advice regarding the above.
>
>  Thanks ,
> Parnab,
> MS student, IIT kharagpur
>
> CONFIDENTIALITY NOTICE
> =====================> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB