Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Union in Multi Insert


+
John Omernik 2013-02-12, 03:34
+
Mark Grover 2013-02-12, 05:08
Copy link to this message
-
Re: Union in Multi Insert
You are correct on the what I am hoping to do, basically emit two records
for every row.  What was interesting was when I just did the union in the
from, it didn't see to do a double table scan. I ended up doing:

INSERT OVERWRITE TABLE table_summary
select col1, unioned_col, count(distinct col4) from
( select col1, col2 as unioned_col, col4 from table
UNION ALL
select col1, col3 as unioned_col, col4 from table
) a
group by col1, unioned_col

When I ran that, it did not run two sets of maps (as I expected it would,
one for each part of the union) is there something that Hive is doing under
the covers to optimize that for me?

I didn't think to use the transform, but you are right, that would be a
very simple transform in any language.

Thoughts on my resultant query? perhaps I am just not getting all the data
I should be?

On Mon, Feb 11, 2013 at 11:08 PM, Mark Grover
<[EMAIL PROTECTED]>wrote:

> John,
> Please correct me if I didn't understand the problem correctly.
>
> I think in this scenario, it's best to think about the query in terms of
> MapReduce. In this case, you would want for each record sent as input to
> your mapper, two records to be emitted, one with col2's value and one with
> col3's value. Then, if you did whatever count/distinct magic you wanted to
> do in the reduce phase, you would have read the table only once.
>
> With the above in mind, if I were doing this, I would consider using
> Hive's transform functionality to use a custom mapper.
>
> Also, FWIW, this part seems unnecessary:
>  from (
>     select
>        col1, col2, col3, col4
>     from table
> ) a
>
> I think you might just be able to do (this is from top of my head, no
> guarantees):
>
> from table
> insert overwrite...
> ....
>
> And, if I misunderstood your problem, my apologies. If you could provide
> an example with sample data and expected output, that might be helpful.
>
>
> Mark
>
> On Mon, Feb 11, 2013 at 7:34 PM, John Omernik <[EMAIL PROTECTED]> wrote:
>
>> I am trying to do a union, group by, and multi insert all at once. I know
>> this convoluted but I what I am trying to do is avoid having to scan
>> through the original table more than once... if I can get all my data from
>> two columns that I want to pull together, in one round of mappers, I win...
>>  Basically, col2 and col3 are they type of data, one is src, one is dst, I
>> want a single record for every uniq value of both col2 and col3.
>>
>> Any thoughts?
>>
>>
>>
>> from (
>>     select
>>        col1, col2, col3, col4
>>     from table
>> ) a
>> INSERT overwrite TABLE table_summary
>> select col1, unioned_col, count(1), count(distinct col4) from
>> (select col1, col2 as unioned_col, col4
>> UNION ALL
>> select col1, col3 as unioned_col, col4
>> ) b
>> group by col1, unioned_col
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB