Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Help!!The problem about Hadoop


Copy link to this message
-
Re: Re: Help!!The problem about Hadoop
Alejandro Abdelnur 2010-10-05, 10:07
Or you could try using MultiFileInputFormat for your MR job.

http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred/MultiFileInputFormat.html

Alejandro

On Tue, Oct 5, 2010 at 4:55 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> 500 small files comprising one gigabyte? Perhaps you should try
> concatenating them all into one big file and try; as a mapper is
> supposed to run at least for a minute optimally. And small files don't
> make good use of the HDFS block feature.
>
> Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/
>
> 2010/10/5 Jander <[EMAIL PROTECTED]>:
>> Hi Jeff,
>>
>> Thank you very much for your reply sincerely.
>>
>> I exactly know hadoop has overhead, but is it too large in my problem?
>>
>> The 1GB text input has about 500 map tasks because the input is composed of little text file. And the time each map taken is from 8 seconds to 20 seconds. I use compression like conf.setCompressMapOutput(true).
>>
>> Thanks,
>> Jander
>>
>>
>>
>>
>> At 2010-10-05 16:28:55,"Jeff Zhang" <[EMAIL PROTECTED]> wrote:
>>
>>>Hi Jander,
>>>
>>>Hadoop has overhead compared to single-machine solution. How many task
>>>have you get when you run your hadoop job ? And what is time consuming
>>>for each map and reduce task ?
>>>
>>>There's lots of tips for performance tuning of hadoop. Such as
>>>compression and jvm reuse.
>>>
>>>
>>>2010/10/5 Jander <[EMAIL PROTECTED]>:
>>>> Hi, all
>>>> I do an application using hadoop.
>>>> I take 1GB text data as input the result as follows:
>>>>    (1) the cluster of 3 PCs: the time consumed is 1020 seconds.
>>>>    (2) the cluster of 4 PCs: the time is about 680 seconds.
>>>> But the application before I use Hadoop takes about 280 seconds, so as the speed above, I must use 8 PCs in order to have the same speed as before. Now the problem: whether it is correct?
>>>>
>>>> Jander,
>>>> Thanks.
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>--
>>>Best Regards
>>>
>>>Jeff Zhang
>>
>
>
>
> --
> Harsh J
> www.harshj.com
>