Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Re: Hadoop noob question


+
Rahul Bhattacharjee 2013-05-11, 15:41
+
Nitin Pawar 2013-05-11, 15:50
+
Mohammad Tariq 2013-05-11, 16:03
+
Nitin Pawar 2013-05-11, 16:05
+
Shahab Yunus 2013-05-11, 16:10
+
Mohammad Tariq 2013-05-11, 17:04
+
Rahul Bhattacharjee 2013-05-11, 17:16
Copy link to this message
-
Re: Hadoop noob question
You'r welcome :)

Warm Regards,
Tariq
cloudfront.blogspot.com
On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
[EMAIL PROTECTED]> wrote:

> Thanks Tariq!
>
>
> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>
>> @Rahul : Yes. distcp can do that.
>>
>> And, bigger the files lesser the metadata hence lesser memory consumption.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>> [EMAIL PROTECTED]> wrote:
>>
>>> IMHO,I think the statement about NN with regard to block metadata is
>>> more like a general statement. Even if you put lots of small files of
>>> combined size 10 TB , you need to have a capable NN.
>>>
>>> can disct cp be used to copy local - to - hdfs ?
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>>>
>>>> absolutely rite Mohammad
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>
>>>>> Every file and block in HDFS is treated as an object and for each
>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>
>>>>> Am I correct @Nitin?
>>>>>
>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>>>>>
>>>>>> NN would still be in picture because it will be writing a lot of meta
>>>>>> data for each individual file. so you will need a NN capable enough which
>>>>>> can store the metadata for your entire dataset. Data will never go to NN
>>>>>> but lot of metadata about data will be on NN so its always good idea to
>>>>>> have a strong NN.
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>
>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rahul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>
>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>
>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>
>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>
>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>
>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>> to speed up the process.
>>>>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>>>>> has not failed when we write a lot of data.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <[EMAIL PROTECTED]>wrote:
>>>>>>>>
>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>
>>>
+
Rahul Bhattacharjee 2013-05-12, 12:30
+
Mohammad Tariq 2013-05-12, 12:10
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB