Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> How to combine input files for a MapReduce job


Copy link to this message
-
Re: How to combine input files for a MapReduce job
Look into mapred.max.split.size mapred.min.split.size and number of mapper
in mapred-site.xml

*Thanks & Regards    *


Shashwat Shriparv

On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <[EMAIL PROTECTED]
> wrote:

>  Hi,****
>
> ** **
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two. Instead of using HDFS, I have written my own
> FileSystem implementation. As an experiment, I kept 1000 text files (all of
> same size) on both the slave nodes and ran a simple Wordcount MR job. It
> took around 50 mins to complete the task. Afterwards, I concatenated all
> the 1000 files into a single file and then ran a Wordcount MR job, it took
> 35 secs. From the JobTracker UI I could make out that the problem is
> because of the number of mappers that JobTracker is creating. For 1000
> files it creates 1000 maps and for 1 file it creates 1 map (irrespective of
> file size). ****
>
> ** **
>
> Thus, is there a way to reduce the number of mappers i.e. can I control
> the number of mappers through some configuration parameter so that Hadoop
> would club all the files until it reaches some specified size (say, 64 MB)
> and then make 1 map per 64 MB block?****
>
> ** **
>
> Also, I wanted to know how to see which file is being submitted to which
> TaskTracker or if that is not possible then how do I check if some data
> transfer is happening in between my slave nodes during a MR job?****
>
> ** **
>
> Sorry for so many questions and Thank you for your time.****
>
> ** **
>
> Regards,****
>
> Nikhil****
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB