Ok but does that imply that only 1 of your compute nodes is promised
to have all of the data for any given row? The blocks will replicate,
but they don't necessarily all replicate to the same nodes right?
So if I have say 2 column families (cf1, cf2) and there is 2 physical
files on the HDFS for those (per region) then those files are created
on one datanode (dn1) which will have all blocks local to that node.
Once it replicates those blocks 2 more times by default, isn't it
possible the blocks for cf1 will go to dn2, dn3 while the blocks for
cf2 goes to dn4, dn5?
On Wed, Aug 29, 2012 at 12:47 AM, N Keywal <[EMAIL PROTECTED]> wrote:
> Locations are per block (a file is a set of blocks, a block is replicated on
> multiple hdfs datanodes).
> We have locality in HBase because hdfs datanodes are deployed on the same
> box as the hbase regionserver and hdfs writes one replica of the blocks on
> the datanode the same machine as the client (i.e. the regionserver from hdfs
> point of view).
> On Wed, Aug 29, 2012 at 6:20 AM, Robert Dyer <[EMAIL PROTECTED]> wrote:
>> I have been reading up on HBase and my understanding is that the
>> physical files on the HDFS are split first by region and then by
>> column families.
>> Thus each column family has its own physical file (on a per-region basis).
>> If I run a MapReduce task that uses the HBase as input, wouldn't this
>> imply that if the task reads from more than 1 column family the data
>> for that row might not be (entirely) local to the task?
>> Is there a way to tell the HDFS to keep blocks of each region's column
>> families together?