Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> disk used percentage is not symmetric on datanodes (balancer)


Copy link to this message
-
Re: disk used percentage is not symmetric on datanodes (balancer)
2013/3/19 Tapas Sarangi <[EMAIL PROTECTED]>

>
> On Mar 19, 2013, at 5:00 AM, Алексей Бабутин <[EMAIL PROTECTED]>
> wrote:
>
> node A=12TB
> node B=72TB
> How many A nodes  and B from 200 do you have?
>
>
> We have more number of A nodes than B. The ratio of the number is about
> 80, 20. Note that not all the B nodes are 72TB, that's a max value.
> Similarly for A it is a min. value.
>
>
> If you have more B than A you can deactivate A,clear it and apply again.
>
>
> Apply what ? It may not be a choice for an active system and it may
> cripple us for days.
>
> I suppose that cluster about 3-5 Tb.Run balancer with threshold 0.2 or 0.1.
>
>
> You meant 3.5 PB, then you are about right.  What this threshold does
> exactly ? We are not setting the threshold manually, but isn't hadoop's
> default 0.1 ?
>
>
> Different servers in one rack is bad idea.You should rebuild cluster with
> multiple racks.
>
>
> Why bad idea ? We are using hadoop as a file system not as a scheduler.
> How multiple racks are going to help in balancing the disk-usage across
> datanodes ?
>
dfs.balance.bandwidthPerSec in hdfs-site.xml.I think balancer cant help
you,because it makes all the nodes equal.They can differ only on balancer
threshold.Threshold =10 by default.It means,that nodes can differ up to
350Tb between each other in 3.5Pb cluster.If Threshold =1 up to 35Tb and so
on.
In ideal case with replication factor 2 ,with two nodes 12Tb and 72Tb you
will be able to have only 12Tb replication data.

Best way,on my opinion,it is using multiple racks.Nodes in rack must be
with identical capacity.Racks must be identical capacity.
For example:

rack1: 1 node with 72Tb
rack2: 6 nodes with 12Tb
rack3: 3 nodes with 24Tb

It helps with balancing,because dublicated  block must be another rack.

Why did you select hdfs?May be lustre,cephfs and other is better choise.
>
> -Tapas
>
>
>
> 2013/3/19 Tapas Sarangi <[EMAIL PROTECTED]>
>
>> Hello,
>>
>> I am using one of the old legacy version (0.20) of hadoop for our
>> cluster. We have scheduled for an upgrade to the newer version within a
>> couple of months, but I would like to understand a couple of things before
>> moving towards the upgrade plan.
>>
>> We have about 200 datanodes and some of them have larger storage than
>> others. The storage for the datanodes varies between 12 TB to 72 TB.
>>
>> We found that the disk-used percentage is not symmetric through all the
>> datanodes. For larger storage nodes the percentage of disk-space used is
>> much lower than that of other nodes with smaller storage space. In larger
>> storage nodes the percentage of used disk space varies, but on average
>> about 30-50%. For the smaller storage nodes this number is as high as
>> 99.9%. Is this expected ? If so, then we are not using a lot of the disk
>> space effectively. Is this solved in a future release ?
>>
>> If no, I would like to know  if there are any checks/debugs that one can
>> do to find an improvement with the current version or upgrading hadoop
>> should solve this problem.
>>
>> I am happy to provide additional information if needed.
>>
>> Thanks for any help.
>>
>> -Tapas
>>
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB