Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> TableInputFormat vs. a map of table regions (data locality)


Copy link to this message
-
Re: TableInputFormat vs. a map of table regions (data locality)
Hi Lars,

Perfect. Thanks for confirming. I have some existing code for which I
want to add HBase support
with minimal modifications to the original code base. I think i need
to provide InputFormat containing
TableSplit.

On a side note, i feel the Key and Values in map, reduce, record
reader methods should be interfaces
and not classes (I guess there is a reason for the change).
Keys/Values should conform to  a contract
but do they need to sit in a class hierarchy?

Cheers
Joy
On Wed, Nov 17, 2010 at 11:55 PM, Lars George <[EMAIL PROTECTED]> wrote:
> Hi Joy,
>
> [1] is what [2] does. They are just a thin wrapper around the raw API.
>
> And as Alex pointed out and you noticed too, [2] adds the benefit to
> have locality support. If you were to add this to [1] then you have
> [2].
>
> Lars
>
> On Thu, Nov 18, 2010 at 5:30 AM, Saptarshi Guha
> <[EMAIL PROTECTED]> wrote:
>> Hello,
>>
>> I'm fairly new to HBase and would appreciate your comments.
>>
>> [1] One way compute across an HBase dataset would be to run as many
>> maps as regions,
>> for each map, run a scan across the region row limits (within the map
>> method). This approach does not use TableInputFormat.In the reduce (if needed),
>> directly write (using put) to the table.
>>
>>
>> [2] In the *second* approach I could use the TableInputFormat and
>> TableOutputFormat.
>>
>> My hypotheses:
>>
>> H1: As for TableOutputFormat, I think both approaches, performance-wise are
>> equivalent. Correct me if I'm wrong.
>>
>> H2: As for TableInputFormat vs. approach[1]. A quick glance through the
>> TableSplit source reveals location information. At first blush I can imagine in
>> approach [1] I scan from row_start to row_end all the data of which
>> resides on a computer different from the compute node on which the split is
>> being run. Since TableInputFormat (approach [2]) uses region information, my
>> guess (not sure at all) is that Hadoop Mapreduce will assign the computation to
>> the node where the region lies and so when the scan is issued the queries will
>> be issued against local data - achieving data locality. So it makes sense to
>> take advantage of (at the least) the TableSplit information.
>>
>> Are my hypotheses correct?
>>
>> Thanks
>> Joy
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB