John Omernik 2013-02-12, 03:34
Please correct me if I didn't understand the problem correctly.
I think in this scenario, it's best to think about the query in terms of
MapReduce. In this case, you would want for each record sent as input to
your mapper, two records to be emitted, one with col2's value and one with
col3's value. Then, if you did whatever count/distinct magic you wanted to
do in the reduce phase, you would have read the table only once.
With the above in mind, if I were doing this, I would consider using Hive's
transform functionality to use a custom mapper.
Also, FWIW, this part seems unnecessary:
col1, col2, col3, col4
I think you might just be able to do (this is from top of my head, no
And, if I misunderstood your problem, my apologies. If you could provide an
example with sample data and expected output, that might be helpful.
On Mon, Feb 11, 2013 at 7:34 PM, John Omernik <[EMAIL PROTECTED]> wrote:
> I am trying to do a union, group by, and multi insert all at once. I know
> this convoluted but I what I am trying to do is avoid having to scan
> through the original table more than once... if I can get all my data from
> two columns that I want to pull together, in one round of mappers, I win...
> Basically, col2 and col3 are they type of data, one is src, one is dst, I
> want a single record for every uniq value of both col2 and col3.
> Any thoughts?
> from (
> col1, col2, col3, col4
> from table
> ) a
> INSERT overwrite TABLE table_summary
> select col1, unioned_col, count(1), count(distinct col4) from
> (select col1, col2 as unioned_col, col4
> UNION ALL
> select col1, col3 as unioned_col, col4
> ) b
> group by col1, unioned_col
John Omernik 2013-02-12, 14:59