Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Help with Join involving Non-Equality condition


Copy link to this message
-
Re: Help with Join involving Non-Equality condition
Bertrand Dechoux 2012-08-17, 05:52
What are the data volume? And what are the meaning of those data?

>From what I can see, you have a 'pack' per day. If that's true, a map join
could be used because you should not have that many pack creation (But I am
not sure how to enforce that.)
I so filtering could happen right after. You would indeed generate lots of
tuple but they wouldn't be transported over the network nor written to disk.

Even better if you really have (at least) a pack per day then you only need
to group each request with three pack creation : the day before, the
current day and the day after.

Regards

Bertrand

On Fri, Aug 17, 2012 at 1:27 AM, Navis류승우 <[EMAIL PROTECTED]> wrote:

> If you don't specify join condition, hive performs cross join.
>
> What is added to hive 0.10.0 is just a clarifying grammar.
>
>
> 2012/8/17 Himanish Kushary <[EMAIL PROTECTED]>
>
>> We are on Hive 0.8 , I think cross join is available only since 0.10.0
>>
>> Do we have any other options ?
>>
>> On Thu, Aug 16, 2012 at 2:28 PM, Ablimit Aji <[EMAIL PROTECTED]> wrote:
>> > You can do a CROSS JOIN, then filter with the original inequality join
>> > condition.
>> > This would generate a lot of redundant tuples and may not work if you
>> have
>> > large amounts of data.
>> >
>> > On Thu, Aug 16, 2012 at 2:07 PM, Himanish Kushary <[EMAIL PROTECTED]>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> We have two tables in the following structure :
>> >>
>> >> Table1 :
>> >>
>> >> |  id   |        packcreatetime         |   packid |
>> >> ----------------------------------------------------------------------
>> >> | 505  |        2012-07-16 11:51:12     | 111024   |
>> >> | 505  |        2012-07-18 11:52:13     | 111025   |
>> >> | 505  |        2012-07-19 11:53:14   | 111026   |
>> >> | 504  |      2012-07-17  23:50:13  |  101020  |
>> >>
>> ------------------------------------------------------------------------
>> >>
>> >> Table-2
>> >>
>> >> | id   |   requesttime
>> >> ----------------------------------------
>> >> | 505 | 2012-07-18 12:09:47
>> >> | 505 | 2012-07-19 12:09:59
>> >> | 505 | 2012-07-19 12:09:56
>> >> | 505 | 2012-07-17 12:06:40
>> >> | 505 | 2012-07-17 12:06:40
>> >> | 505 | 2012-07-17 12:09:15
>> >> | 504 | 2012-07-18 00:03:18
>> >> | 504 | 2012-07-18 00:15:41
>> >>
>> >> We want to find out the packid from Table1 where the  is corresponding
>> >> in Table2 and the requesttime(in Table2) is between the
>> >> packcreatetime of two relevant records(in Table1)
>> >>
>> >> So for the above example the final output will be:
>> >>
>> >> | id   |   requesttime            |   packid
>> >> -------------------------------------------------------
>> >> | 505 | 2012-07-18 12:09:47 |  111025
>> >> | 505 | 2012-07-19 12:09:59 |  111026
>> >> | 505 | 2012-07-19 12:09:56 |  111026
>> >> | 505 | 2012-07-17 12:06:40 |  111024
>> >> | 505 | 2012-07-17 12:06:40 |  111024
>> >> | 505 | 2012-07-17 12:09:15 |  111024
>> >> | 504 | 2012-07-18 00:03:18 |  101020
>> >> | 504 | 2012-07-18 00:15:41 |  101020
>> >>
>> >>
>> >> As we cannot use >= , <= in Hive joins the between logic cannot be
>> >> implemented in joins, is there any way to accomplish this or do we
>> >> need to write custom M/R code for this.Looking forward for any
>> >> suggestions to accomplish this.
>> >>
>> >> --
>> >> Thanks & Regards
>> >> Himanish
>> >
>> >
>>
>>
>>
>> --
>> Thanks & Regards
>> Himanish
>>
>
>
--
Bertrand Dechoux