John Omernik 2013-02-12, 03:34
Mark Grover 2013-02-12, 05:08
You are correct on the what I am hoping to do, basically emit two records
for every row. What was interesting was when I just did the union in the
from, it didn't see to do a double table scan. I ended up doing:
INSERT OVERWRITE TABLE table_summary
select col1, unioned_col, count(distinct col4) from
( select col1, col2 as unioned_col, col4 from table
select col1, col3 as unioned_col, col4 from table
group by col1, unioned_col
When I ran that, it did not run two sets of maps (as I expected it would,
one for each part of the union) is there something that Hive is doing under
the covers to optimize that for me?
I didn't think to use the transform, but you are right, that would be a
very simple transform in any language.
Thoughts on my resultant query? perhaps I am just not getting all the data
I should be?
On Mon, Feb 11, 2013 at 11:08 PM, Mark Grover
> Please correct me if I didn't understand the problem correctly.
> I think in this scenario, it's best to think about the query in terms of
> MapReduce. In this case, you would want for each record sent as input to
> your mapper, two records to be emitted, one with col2's value and one with
> col3's value. Then, if you did whatever count/distinct magic you wanted to
> do in the reduce phase, you would have read the table only once.
> With the above in mind, if I were doing this, I would consider using
> Hive's transform functionality to use a custom mapper.
> Also, FWIW, this part seems unnecessary:
> from (
> col1, col2, col3, col4
> from table
> ) a
> I think you might just be able to do (this is from top of my head, no
> from table
> insert overwrite...
> And, if I misunderstood your problem, my apologies. If you could provide
> an example with sample data and expected output, that might be helpful.
> On Mon, Feb 11, 2013 at 7:34 PM, John Omernik <[EMAIL PROTECTED]> wrote:
>> I am trying to do a union, group by, and multi insert all at once. I know
>> this convoluted but I what I am trying to do is avoid having to scan
>> through the original table more than once... if I can get all my data from
>> two columns that I want to pull together, in one round of mappers, I win...
>> Basically, col2 and col3 are they type of data, one is src, one is dst, I
>> want a single record for every uniq value of both col2 and col3.
>> Any thoughts?
>> from (
>> col1, col2, col3, col4
>> from table
>> ) a
>> INSERT overwrite TABLE table_summary
>> select col1, unioned_col, count(1), count(distinct col4) from
>> (select col1, col2 as unioned_col, col4
>> UNION ALL
>> select col1, col3 as unioned_col, col4
>> ) b
>> group by col1, unioned_col