As far as I understand Bulk Import functionality will not take into account
the Data Locality question. MR job will create number of reducer tasks same
as regions to write into, but it will not "advice" on which nodes to run
these tasks. In that case Reducer task which writes HFiles of some region
may not be physically located at the same node as RS that serves that
region. The way HDFS writes data, there will be (likely) one full replica
of bolcks of HFiles of this Region written on the node where Reducer task
was run and other replicas (if replication >1) will be distributed randomly
over the cluster. Thus, RS while serving data of that region will (most
likely) not look at local data (data will be transferred from other
datanodes). I.e. data locality will be broken.
Is this correct?
If yes, I guess, if we could tell MR framework where (which nodes) to
launch certain Reducer tasks, this would help us. I believe this is not
possible with MR1, please correct me if I'm wrong. Perhaps, this is this
possible with MR2?
I assume there's no way to provide a "hint" to a NameNode where to place
blocks of a new File too, right?
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -