Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> How to combine input files for a MapReduce job


Copy link to this message
-
Re: How to combine input files for a MapReduce job
Yes I believe the branch-1 patch attached there should apply cleanly to 1.0.4.

On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil
<[EMAIL PROTECTED]> wrote:
> Hi,
>
> @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?
>
> -----Original Message-----
> From: Harsh J [mailto:[EMAIL PROTECTED]]
> Sent: Monday, May 13, 2013 1:03 PM
> To: <[EMAIL PROTECTED]>
> Subject: Re: How to combine input files for a MapReduce job
>
> For "control number of mappers" question: You can use http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
> which is designed to solve similar cases. However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.
>
> For "which file is being submitted to which" question: Having
> https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help.
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and
>> TaskTrackers on other two. Instead of using HDFS, I have written my
>> own FileSystem implementation. As an experiment, I kept 1000 text
>> files (all of same size) on both the slave nodes and ran a simple
>> Wordcount MR job. It took around 50 mins to complete the task.
>> Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it took
>> 35 secs. From the JobTracker UI I could make out that the problem is
>> because of the number of mappers that JobTracker is creating. For 1000
>> files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I
>> control the number of mappers through some configuration parameter so
>> that Hadoop would club all the files until it reaches some specified
>> size (say, 64 MB) and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to
>> which TaskTracker or if that is not possible then how do I check if
>> some data transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>
>
> --
> Harsh J

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB