Currently I am running hadoop version 0.20.203 in production with 600 TB in her.
I am planning to enable rack awareness in my production, but I still
didn't see it through.
1. I have script that can solve datanode/tasktracker IP to rack name.
2. Add topology.script.file.name in hdfs-site.xml and restart cluster.
3. After the cluster come back, my question start here,
- do i have to run balancer or fsck or some command to have those
600 TB become redistribute to different rack in one time ?
- currently i run balancer 2 hrs. everyday, can i keep this
routine and hope that at some point the data will be nicely
redistributed and aware of rack location ?
- how could we know that the data in the cluster is now fully rack
- if i just add the script and run balancer 2 hrs everyday, before
the whole data become rack awareness. the data will be kind
of mix between "default-rack" of existing data (haven't get
balanced) and probably new loaded data will be rack-awareness.
is it OK ? to have mix of default-rack and rack-specific data together ?
4. thought ?
Hope this make sense,
Thanks in advance