Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Hadoop noob question


Copy link to this message
-
Re: Hadoop noob question
Mohammad Tariq 2013-05-12, 12:37
No. distcp is actually a mapreduce job under the hood.

Warm Regards,
Tariq
cloudfront.blogspot.com
On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
[EMAIL PROTECTED]> wrote:

> Thanks to both of you!
>
> Rahul
>
>
> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>
>> you can do that using file:///
>>
>> example:
>>
>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>
>>
>>
>>
>>
>>
>>
>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>> [EMAIL PROTECTED]> wrote:
>>
>>> @Tariq can you point me to some resource which shows how distcp is used
>>> to upload files from local to hdfs.
>>>
>>> isn't distcp a MR job ? wouldn't it need the data to be already present
>>> in the hadoop's fs?
>>>
>>>  Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>>>
>>>> You'r welcome :)
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>> [EMAIL PROTECTED]> wrote:
>>>>
>>>>> Thanks Tariq!
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>>>>>
>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>
>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>> consumption.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>
>>>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>>>> more like a general statement. Even if you put lots of small files of
>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>
>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rahul
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>
>>>>>>>> absolutely rite Mohammad
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <[EMAIL PROTECTED]
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>
>>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>>
>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>
>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>>>
>>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>>> have a strong NN.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>>>>
>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to