Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Question about the time to execute joins in HBase!


Copy link to this message
-
Re: Question about the time to execute joins in HBase!
Michael Segel 2013-08-22, 17:58
Pig and Hive will generate a map/reduce job

So you have 3 tables that you want to join.

Ok so one is 60 million rows. One is 2 million and 1 is 1 million.

What sort of join?

Can you write your join in terms of a relationship?
Could you write it a SQL like code?

Join table A to table B ON  A.x = B.x?

Are you filtering too?

The trick... you need to get a list of all distinct values of X for each table.  This is why you kinda have to have two different indexes to make the joins faster.

Here's the trick...

Suppose you want to Join two tables on column foo and foo is going to be a FK between the tables.

So you need to have an inverted table A.foo_idx  and B.foo_idx  for each table, then a row in yet another inverted table idx_master.
Where the row key is the table name A.foo_idx and the columns contain the unique row key for the index.

And that's the confusing part.

But it works.

Again the downside is that you need to make sure the rows don't exceed the width of the region and there's a trick to splitting the rows and still keeping them in sort order....

And that's the basics.

On Aug 22, 2013, at 11:02 AM, Pavan Sudheendra <[EMAIL PROTECTED]> wrote:

> FYI i'm here to just getting other views on how much would it run in their
> system compared to mine?
>
> because just to process 600,000 map input records in an hour is just
> wrong.. And it doesn't even show any map % increase.. Its at 0% throughout.
>
>
> On Thu, Aug 22, 2013 at 9:18 PM, Pavan Sudheendra <[EMAIL PROTECTED]>wrote:
>
>> Yes Michael i think so.. I was googling about what you said.. I'm afraid
>> i'm not aware of most of the terms.. I'm still yet to learn but don't have
>> much time. :(
>>
>>
>> On Thu, Aug 22, 2013 at 9:16 PM, Michael Segel <[EMAIL PROTECTED]>wrote:
>>
>>> You kind of have two threads along the same lines.
>>>
>>> See my response in your other thread...
>>>
>>> On Aug 22, 2013, at 10:41 AM, Pavan Sudheendra <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>> scan.setCaching(500);
>>>>
>>>> I really don't understand this purpose though..
>>>>
>>>>
>>>> On Thu, Aug 22, 2013 at 9:09 PM, Kevin O'dell <[EMAIL PROTECTED]
>>>> wrote:
>>>>
>>>>> QQ what is your caching set to?
>>>>> On Aug 22, 2013 11:25 AM, "Pavan Sudheendra" <[EMAIL PROTECTED]>
>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> A serious question.. I know this isn't one of the best hbase practices
>>>>> but
>>>>>> I really want to know..
>>>>>>
>>>>>> I am doing a join across 3 table in hbase.. One table contain 19m
>>>>> records,
>>>>>> one contains 2m and another contains 1m records.
>>>>>>
>>>>>> I'm doing this inside the mapper function.. I know this can be done
>>> with
>>>>>> pig and hive etc. Leaving the specifics out, how long would experts
>>> think
>>>>>> it would take for the mapper to finish aggregating them across a 6
>>> node
>>>>>> cluster.. One is the job tracker and 5 are task trackers.. By the
>>> time I
>>>>>> see the map reduce job status for input records reach 600,000 it's
>>> taking
>>>>>> an hour.. It can't be right..
>>>>>>
>>>>>> Any tips? Please help.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> --
>>>>>> Regards-
>>>>>> Pavan
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards-
>>>> Pavan
>>>
>>> The opinions expressed here are mine, while they may reflect a cognitive
>>> thought, that is purely accidental.
>>> Use at your own risk.
>>> Michael Segel
>>> michael_segel (AT) hotmail.com
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Regards-
>> Pavan
>>
>
>
>
> --
> Regards-
> Pavan

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental.
Use at your own risk.
Michael Segel
michael_segel (AT) hotmail.com