Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: Hadoop noob question


Copy link to this message
-
Re: Hadoop noob question
absolutely rite Mohammad
On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> Sorry for barging in guys. I think Nitin is talking about this :
>
> Every file and block in HDFS is treated as an object and for each object
> around 200B of metadata get created. So the NN should be powerful enough to
> handle that much metadata, since it is going to be in-memory. Actually
> memory is the most important metric when it comes to NN.
>
> Am I correct @Nitin?
>
> @Thoihen : As Nitin has said, when you talk about that much data you don't
> actually just do a "put". You could use something like "distcp" for
> parallel copying. A better approach would be to use a data aggregation tool
> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
> data aggregation tool, called Scribe for this purpose.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>
>> NN would still be in picture because it will be writing a lot of meta
>> data for each individual file. so you will need a NN capable enough which
>> can store the metadata for your entire dataset. Data will never go to NN
>> but lot of metadata about data will be on NN so its always good idea to
>> have a strong NN.
>>
>>
>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>> [EMAIL PROTECTED]> wrote:
>>
>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>> understand the meaning of capable NN. As I know , the NN would not be a
>>> part of the actual data write pipeline , means that the data would not
>>> travel through the NN , the dfs would contact the NN from time to time to
>>> get locations of DN as where to store the data blocks.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>>
>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>>>
>>>> is it safe? .. there is no direct answer yes or no
>>>>
>>>> when you say , you have files worth 10TB files and you want to upload
>>>>  to HDFS, several factors come into picture
>>>>
>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>> 2) If there guarantee that network will not go down?
>>>>
>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>> By that I mean you have a capable namenode.
>>>>
>>>> I would definitely not write files sequentially in HDFS. I would prefer
>>>> to write files in parallel to hdfs to utilize the DFS write features to
>>>> speed up the process.
>>>> you can hdfs put command in parallel manner and in my experience it has
>>>> not failed when we write a lot of data.
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>
>>>>> But I have one more question , say I have 10 TB data in the pipeline .
>>>>>
>>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>>> of size 10 TB and is there any limit to the file size  using hadoop command
>>>>> line . Can hadoop put command line work with huge data.
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>>>>>
>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>> from there its sent to archivers or deleted.
>>>>>>
>>>>>> For data management products, you can look at falcon which is open
>>>>>> sourced by inmobi along with hortonworks.
>>>>>>
>>>>>> In any case, if you want to write files to hdfs there are few options
>>>>>> available to you
>>>>>> 1) Write your dfs client which writes to dfs
>>>>>> 2) use hdfs proxy
>>>>>> 3) there is webhdfs
>>>>>> 4) command line hdfs
>>>>>> 5) data collection tools come with support to write to hdfs like
Nitin Pawar