Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> data distribution in HDFS


Copy link to this message
-
Re: data distribution in HDFS
AFAIK there is no way to disable this "feature" . This is an optimization. It happens because in your case the node generating the data is also a data node.

Raj

>________________________________
> From: Stijn De Weirdt <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Sent: Monday, April 2, 2012 12:18 PM
>Subject: Re: data distribution in HDFS
>
>thanks serge.
>
>
>is there a way to disable this "feature" (ie place first block always on
>local node)?
>and is this because the local node is a datanode? or is there always a
>"local node" with datatransfers?
>
>many thanks,
>
>stijn
>
>> Local node is a node from where you are coping data from
>>
>> If lets say you are using -copyFromLocal option
>>
>>
>> Regards
>> Serge
>>
>> On 4/2/12 11:53 AM, "Stijn De Weirdt"<[EMAIL PROTECTED]>  wrote:
>>
>>> hi raj,
>>>
>>> what is a "local node"? is it relative to the tasks that are started?
>>>
>>>
>>> stijn
>>>
>>> On 04/02/2012 07:28 PM, Raj Vishwanathan wrote:
>>>> Stijn,
>>>>
>>>> The first block of the data , is always stored in the local node.
>>>> Assuming that you had a replication factor of 3, the node that generates
>>>> the data will get about 10GB of data and the other 20GB will be
>>>> distributed among other nodes.
>>>>
>>>> Raj
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> ________________________________
>>>>> From: Stijn De Weirdt<[EMAIL PROTECTED]>
>>>>> To: [EMAIL PROTECTED]
>>>>> Sent: Monday, April 2, 2012 9:54 AM
>>>>> Subject: data distribution in HDFS
>>>>>
>>>>> hi all,
>>>>>
>>>>> i'm just started to play around with hdfs+mapred. i'm currently
>>>>> playing with teragen/sort/validate to see if i understand all.
>>>>>
>>>>> the test setup involves 5 nodes that all are tasktracker and datanode
>>>>> (and one node that is also jobtracker and namenode on top of that.
>>>>> (this one node is running both the namenode hadoop process as the
>>>>> datanode process)
>>>>>
>>>>> when i do the in teragen run, the data is not distributed equally over
>>>>> all nodes. the node that is also namenode, get's a bigger portion of
>>>>> all the data. (as seen by df on the nodes and by using dsfadmin -report)
>>>>> i also get this distribution when i ran the TestDFSIO write test (50
>>>>> files of 1GB)
>>>>>
>>>>>
>>>>> i use basic command line  teragen $((100*1000*1000))
>>>>> /benchmarks/teragen, so i expect 100M*0.1kb = 10GB of data. (if i add
>>>>> the volumes in use by hdfs, it's actually quite a bit more.)
>>>>> 4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in
>>>>> use. so this one datanode is seen as 2 nodes.
>>>>>
>>>>> when i do ls on the filesystem, i see that teragen created 250MB
>>>>> files, the current hdfs blocksize is 64MB.
>>>>>
>>>>> is there a reason why one datanode is preferred over the others.
>>>>> it is annoying since the terasort output behaves the same, and i can't
>>>>> use the full hdfs space for testing that way. also, since more IO comes
>>>>> to this one node, the performance isn't really balanced.
>>>>>
>>>>> many thanks,
>>>>>
>>>>> stijn
>>>>>
>>>>>
>>>>>
>>>
>>
>>
>
>
>
>