|
|
John Omernik 2013-02-12, 03:34
I am trying to do a union, group by, and multi insert all at once. I know this convoluted but I what I am trying to do is avoid having to scan through the original table more than once... if I can get all my data from two columns that I want to pull together, in one round of mappers, I win... Basically, col2 and col3 are they type of data, one is src, one is dst, I want a single record for every uniq value of both col2 and col3.
Any thoughts?
from ( select col1, col2, col3, col4 from table ) a INSERT overwrite TABLE table_summary select col1, unioned_col, count(1), count(distinct col4) from (select col1, col2 as unioned_col, col4 UNION ALL select col1, col3 as unioned_col, col4 ) b group by col1, unioned_col
+
John Omernik 2013-02-12, 03:34
-
Re: Union in Multi Insert
Mark Grover 2013-02-12, 05:08
John, Please correct me if I didn't understand the problem correctly.
I think in this scenario, it's best to think about the query in terms of MapReduce. In this case, you would want for each record sent as input to your mapper, two records to be emitted, one with col2's value and one with col3's value. Then, if you did whatever count/distinct magic you wanted to do in the reduce phase, you would have read the table only once.
With the above in mind, if I were doing this, I would consider using Hive's transform functionality to use a custom mapper.
Also, FWIW, this part seems unnecessary: from ( select col1, col2, col3, col4 from table ) a
I think you might just be able to do (this is from top of my head, no guarantees):
from table insert overwrite... ....
And, if I misunderstood your problem, my apologies. If you could provide an example with sample data and expected output, that might be helpful. Mark
On Mon, Feb 11, 2013 at 7:34 PM, John Omernik <[EMAIL PROTECTED]> wrote:
> I am trying to do a union, group by, and multi insert all at once. I know > this convoluted but I what I am trying to do is avoid having to scan > through the original table more than once... if I can get all my data from > two columns that I want to pull together, in one round of mappers, I win... > Basically, col2 and col3 are they type of data, one is src, one is dst, I > want a single record for every uniq value of both col2 and col3. > > Any thoughts? > > > > from ( > select > col1, col2, col3, col4 > from table > ) a > INSERT overwrite TABLE table_summary > select col1, unioned_col, count(1), count(distinct col4) from > (select col1, col2 as unioned_col, col4 > UNION ALL > select col1, col3 as unioned_col, col4 > ) b > group by col1, unioned_col >
+
Mark Grover 2013-02-12, 05:08
-
Re: Union in Multi Insert
John Omernik 2013-02-12, 14:59
You are correct on the what I am hoping to do, basically emit two records for every row. What was interesting was when I just did the union in the from, it didn't see to do a double table scan. I ended up doing:
INSERT OVERWRITE TABLE table_summary select col1, unioned_col, count(distinct col4) from ( select col1, col2 as unioned_col, col4 from table UNION ALL select col1, col3 as unioned_col, col4 from table ) a group by col1, unioned_col
When I ran that, it did not run two sets of maps (as I expected it would, one for each part of the union) is there something that Hive is doing under the covers to optimize that for me?
I didn't think to use the transform, but you are right, that would be a very simple transform in any language.
Thoughts on my resultant query? perhaps I am just not getting all the data I should be?
On Mon, Feb 11, 2013 at 11:08 PM, Mark Grover <[EMAIL PROTECTED]>wrote:
> John, > Please correct me if I didn't understand the problem correctly. > > I think in this scenario, it's best to think about the query in terms of > MapReduce. In this case, you would want for each record sent as input to > your mapper, two records to be emitted, one with col2's value and one with > col3's value. Then, if you did whatever count/distinct magic you wanted to > do in the reduce phase, you would have read the table only once. > > With the above in mind, if I were doing this, I would consider using > Hive's transform functionality to use a custom mapper. > > Also, FWIW, this part seems unnecessary: > from ( > select > col1, col2, col3, col4 > from table > ) a > > I think you might just be able to do (this is from top of my head, no > guarantees): > > from table > insert overwrite... > .... > > And, if I misunderstood your problem, my apologies. If you could provide > an example with sample data and expected output, that might be helpful. > > > Mark > > On Mon, Feb 11, 2013 at 7:34 PM, John Omernik <[EMAIL PROTECTED]> wrote: > >> I am trying to do a union, group by, and multi insert all at once. I know >> this convoluted but I what I am trying to do is avoid having to scan >> through the original table more than once... if I can get all my data from >> two columns that I want to pull together, in one round of mappers, I win... >> Basically, col2 and col3 are they type of data, one is src, one is dst, I >> want a single record for every uniq value of both col2 and col3. >> >> Any thoughts? >> >> >> >> from ( >> select >> col1, col2, col3, col4 >> from table >> ) a >> INSERT overwrite TABLE table_summary >> select col1, unioned_col, count(1), count(distinct col4) from >> (select col1, col2 as unioned_col, col4 >> UNION ALL >> select col1, col3 as unioned_col, col4 >> ) b >> group by col1, unioned_col >> > >
+
John Omernik 2013-02-12, 14:59
|
|