Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: Hadoop noob question


+
Rahul Bhattacharjee 2013-05-11, 16:10
+
Thoihen Maibam 2013-05-11, 10:49
+
Nitin Pawar 2013-05-11, 10:54
+
maisnam ns 2013-05-11, 11:08
+
Nitin Pawar 2013-05-11, 11:24
+
Mohammad Tariq 2013-05-12, 13:42
+
Rahul Bhattacharjee 2013-05-12, 11:53
+
Nitin Pawar 2013-05-12, 12:06
+
Mohammad Tariq 2013-05-12, 12:37
+
Rahul Bhattacharjee 2013-05-12, 12:45
Copy link to this message
-
Re: Hadoop noob question
I had said that if you use distcp to copy data *from localFS to HDFS* then
you won't be able to exploit parallelism as entire file is present on a
single machine. So no multiple TTs.

Please comment if you think I am wring somewhere.

Warm Regards,
Tariq
cloudfront.blogspot.com
On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
[EMAIL PROTECTED]> wrote:

> Yes , it's a MR job under the hood . my question was that you wrote that
> using distcp you loose the benefits  of parallel processing of Hadoop. I
> think the MR job of distcp divides files into individual map tasks based on
> the total size of the transfer , so multiple mappers would still be spawned
> if the size of transfer is huge and they would work in parallel.
>
> Correct me if there is anything wrong!
>
> Thanks,
> Rahul
>
>
> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>
>> No. distcp is actually a mapreduce job under the hood.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>> [EMAIL PROTECTED]> wrote:
>>
>>> Thanks to both of you!
>>>
>>> Rahul
>>>
>>>
>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>>>
>>>> you can do that using file:///
>>>>
>>>> example:
>>>>
>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>> [EMAIL PROTECTED]> wrote:
>>>>
>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>> used to upload files from local to hdfs.
>>>>>
>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>> present in the hadoop's fs?
>>>>>
>>>>>  Rahul
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>>>>>
>>>>>> You'r welcome :)
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>
>>>>>>> Thanks Tariq!
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <[EMAIL PROTECTED]
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>
>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>> consumption.
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>>
>>>>>>>>> IMHO,I think the statement about NN with regard to block metadata
>>>>>>>>> is more like a general statement. Even if you put lots of small files of
>>>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>>>
>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Rahul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>>>
>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>>>>
>>>>>>>>>>> Every file and block in HDFS is treated as an object and for
>>>>>>>>>>> each object around 200B of metadata get created. So the NN should be
>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to be
>>>>>>>>>>> in-memory. Actually memory is the most important metric when it comes to
>>>>>>>>>>> NN.
>>>>>>>>>>>
>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>
>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
+
Chris Mawata 2013-05-12, 14:21
+
Rahul Bhattacharjee 2013-05-16, 14:18