Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> How to combine input files for a MapReduce job


Copy link to this message
-
Re: How to combine input files for a MapReduce job
For "control number of mappers" question: You can use
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
which is designed to solve similar cases. However, you cannot beat the
speed you get out of a single large file (or a few large files), as
you'll still have file open/close overheads which will bog you down.

For "which file is being submitted to which" question: Having
https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the
version/distribution of Apache Hadoop you use would help.

On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil
<[EMAIL PROTECTED]> wrote:
> Hi,
>
>
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two. Instead of using HDFS, I have written my own
> FileSystem implementation. As an experiment, I kept 1000 text files (all of
> same size) on both the slave nodes and ran a simple Wordcount MR job. It
> took around 50 mins to complete the task. Afterwards, I concatenated all the
> 1000 files into a single file and then ran a Wordcount MR job, it took 35
> secs. From the JobTracker UI I could make out that the problem is because of
> the number of mappers that JobTracker is creating. For 1000 files it creates
> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>
>
>
> Thus, is there a way to reduce the number of mappers i.e. can I control the
> number of mappers through some configuration parameter so that Hadoop would
> club all the files until it reaches some specified size (say, 64 MB) and
> then make 1 map per 64 MB block?
>
>
>
> Also, I wanted to know how to see which file is being submitted to which
> TaskTracker or if that is not possible then how do I check if some data
> transfer is happening in between my slave nodes during a MR job?
>
>
>
> Sorry for so many questions and Thank you for your time.
>
>
>
> Regards,
>
> Nikhil

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB