Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Hadoop noob question


Copy link to this message
-
Re: Hadoop noob question
Yes , it's a MR job under the hood . my question was that you wrote that
using distcp you loose the benefits  of parallel processing of Hadoop. I
think the MR job of distcp divides files into individual map tasks based on
the total size of the transfer , so multiple mappers would still be spawned
if the size of transfer is huge and they would work in parallel.

Correct me if there is anything wrong!

Thanks,
Rahul
On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> No. distcp is actually a mapreduce job under the hood.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
> [EMAIL PROTECTED]> wrote:
>
>> Thanks to both of you!
>>
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>>
>>> you can do that using file:///
>>>
>>> example:
>>>
>>>
>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> @Tariq can you point me to some resource which shows how distcp is used
>>>> to upload files from local to hdfs.
>>>>
>>>> isn't distcp a MR job ? wouldn't it need the data to be already present
>>>> in the hadoop's fs?
>>>>
>>>>  Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> You'r welcome :)
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>> [EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> Thanks Tariq!
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>>>>>>
>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>
>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>> consumption.
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>
>>>>>>>> IMHO,I think the statement about NN with regard to block metadata
>>>>>>>> is more like a general statement. Even if you put lots of small files of
>>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>>
>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>>
>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>>>
>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>>
>>>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>>>
>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>
>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>>>
>>>>>>>>>> Warm Regards,
>>>>>>>>>> Tariq
>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>>>>
>>>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to