Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - TableInputFormat vs. a map of table regions (data locality)


Copy link to this message
-
Re: TableInputFormat vs. a map of table regions (data locality)
Saptarshi Guha 2010-11-18, 17:38
Hi Lars,

Perfect. Thanks for confirming. I have some existing code for which I
want to add HBase support
with minimal modifications to the original code base. I think i need
to provide InputFormat containing
TableSplit.

On a side note, i feel the Key and Values in map, reduce, record
reader methods should be interfaces
and not classes (I guess there is a reason for the change).
Keys/Values should conform to  a contract
but do they need to sit in a class hierarchy?

Cheers
Joy
On Wed, Nov 17, 2010 at 11:55 PM, Lars George <[EMAIL PROTECTED]> wrote:
> Hi Joy,
>
> [1] is what [2] does. They are just a thin wrapper around the raw API.
>
> And as Alex pointed out and you noticed too, [2] adds the benefit to
> have locality support. If you were to add this to [1] then you have
> [2].
>
> Lars
>
> On Thu, Nov 18, 2010 at 5:30 AM, Saptarshi Guha
> <[EMAIL PROTECTED]> wrote:
>> Hello,
>>
>> I'm fairly new to HBase and would appreciate your comments.
>>
>> [1] One way compute across an HBase dataset would be to run as many
>> maps as regions,
>> for each map, run a scan across the region row limits (within the map
>> method). This approach does not use TableInputFormat.In the reduce (if needed),
>> directly write (using put) to the table.
>>
>>
>> [2] In the *second* approach I could use the TableInputFormat and
>> TableOutputFormat.
>>
>> My hypotheses:
>>
>> H1: As for TableOutputFormat, I think both approaches, performance-wise are
>> equivalent. Correct me if I'm wrong.
>>
>> H2: As for TableInputFormat vs. approach[1]. A quick glance through the
>> TableSplit source reveals location information. At first blush I can imagine in
>> approach [1] I scan from row_start to row_end all the data of which
>> resides on a computer different from the compute node on which the split is
>> being run. Since TableInputFormat (approach [2]) uses region information, my
>> guess (not sure at all) is that Hadoop Mapreduce will assign the computation to
>> the node where the region lies and so when the scan is issued the queries will
>> be issued against local data - achieving data locality. So it makes sense to
>> take advantage of (at the least) the TableSplit information.
>>
>> Are my hypotheses correct?
>>
>> Thanks
>> Joy
>>
>