|
|
-
Re: count of distinct FROM multiple columns
Jan Dolinár 2012-06-22, 12:52
Hi
A quick solution that comes first to my mind is to join the columns you want to combine into an array and then use the explode UDTF:
SELECT col1, COUNT(distinct combined) FROM dummy LATERAL VIEW explode(array(col2, col3)) t AS combined GROUP BY col1;
Although I believe there might be simpler and/or better solutions.
Jan On Fri, Jun 22, 2012 at 2:32 PM, MIS <[EMAIL PROTECTED]> wrote:
> Hi All, > > I have a table in Hive as below: > > dummy { > col1 STRING, > col2 INT, > col3 INT > } > > And in that there is some sample data as : > > *col1 col2 col3 * > ABC 4 5 > XYZ 1 2 > ABC 1 3 > ABC 5 1 > XYZ 3 1 > > What should be my query so as to get the below result: > > *ABC 4 > XYZ 3* > > Basically I'm trying to get a count of distinct elements from *col2 and > col3 combined* and group them against col1. > > I tried with a query as: > > select col1, count(distinct col2, col3) from dummy group by col1 > > But didn't get the expected output. > > Can anybody point me in the correct direction and suggest a correct query. > > Thanks. >
+
Jan Dolinár 2012-06-22, 12:52
-
Re: count of distinct FROM multiple columns
Mark Grover 2012-06-22, 17:29
A solution that comes to my mind is to use a union. Something like (untested):
select first, count(distinct second) from (select col1 as first, col2 as second from dummy union all select col1 as first, col3 as second from dummy )t group by first;
Mark ----- Original Message ----- From: "Jan Dolinár" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Friday, June 22, 2012 8:52:43 AM Subject: Re: count of distinct FROM multiple columns Hi A quick solution that comes first to my mind is to join the columns you want to combine into an array and then use the explode UDTF:
SELECT col1, COUNT(distinct combined) FROM dummy LATERAL VIEW explode(array(col2, col3)) t AS combined GROUP BY col1;
Although I believe there might be simpler and/or better solutions. Jan
On Fri, Jun 22, 2012 at 2:32 PM, MIS < [EMAIL PROTECTED] > wrote: Hi All,
I have a table in Hive as below:
dummy { col1 STRING, col2 INT, col3 INT }
And in that there is some sample data as :
col1 col2 col3 ABC 4 5 XYZ 1 2 ABC 1 3 ABC 5 1 XYZ 3 1
What should be my query so as to get the below result:
ABC 4 XYZ 3
Basically I'm trying to get a count of distinct elements from col2 and col3 combined and group them against col1.
I tried with a query as:
select col1, count(distinct col2, col3) from dummy group by col1
But didn't get the expected output.
Can anybody point me in the correct direction and suggest a correct query.
Thanks.
+
Mark Grover 2012-06-22, 17:29
-
Re: count of distinct FROM multiple columns
Edward Capriolo 2012-06-22, 17:42
I think your syntax is wrong. Hive should support multi-column distinct and at that point counting should work.
You did: select col1, count(distinct col2, col3) from dummy group by col1
I think the correct syntax is: select col1, count(distinct (col2, col3)) from dummy group by col1
On Fri, Jun 22, 2012 at 1:29 PM, Mark Grover <[EMAIL PROTECTED]> wrote: > A solution that comes to my mind is to use a union. Something like (untested): > > select > first, > count(distinct second) > from > (select > col1 as first, > col2 as second > from > dummy > union all > select > col1 as first, > col3 as second > from > dummy > )t > group by > first; > > Mark > ----- Original Message ----- > From: "Jan Dolinár" <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Friday, June 22, 2012 8:52:43 AM > Subject: Re: count of distinct FROM multiple columns > > > Hi > > > A quick solution that comes first to my mind is to join the columns you want to combine into an array and then use the explode UDTF: > > SELECT col1, COUNT(distinct combined) FROM dummy LATERAL VIEW explode(array(col2, col3)) t AS combined GROUP BY col1; > > Although I believe there might be simpler and/or better solutions. > > > Jan > > > > On Fri, Jun 22, 2012 at 2:32 PM, MIS < [EMAIL PROTECTED] > wrote: > > > Hi All, > > I have a table in Hive as below: > > dummy { > col1 STRING, > col2 INT, > col3 INT > } > > And in that there is some sample data as : > > col1 col2 col3 > ABC 4 5 > XYZ 1 2 > ABC 1 3 > ABC 5 1 > XYZ 3 1 > > What should be my query so as to get the below result: > > ABC 4 > XYZ 3 > > Basically I'm trying to get a count of distinct elements from col2 and col3 combined and group them against col1. > > I tried with a query as: > > select col1, count(distinct col2, col3) from dummy group by col1 > > But didn't get the expected output. > > Can anybody point me in the correct direction and suggest a correct query. > > Thanks. >
+
Edward Capriolo 2012-06-22, 17:42
-
Re: count of distinct FROM multiple columns
MIS 2012-06-23, 07:34
Thanks for your replies guys. Of the three solutions, the one from Jan worked. The other two solutions failed because of parse errors.
@Edward, I don't think the syntax you suggested exits. Yes Hive supports Multi Column distinct, But it is actually distinct *ON*multiple columns and i think not *FROM* multiple columns.
If anybody wants to have a go at it, below are the table creation steps for them: {Just changed the data types of Column2 and Column3 to be String instead of int}
*CREATE TABLE dummy(column1 STRING, column2 STRING, column3 STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES("input.regex" = "([^ ]*) ([^ ]*) ([^ ]*)", "output.format.string" = "%1$s %2$s %3$s") STORED AS TEXTFILE;*
have the above mentioned data in the dummy.txt file and then load it with:
*LOAD DATA LOCAL INPATH '/tmp/dummy.txt.gz' OVERWRITE INTO TABLE dummy*
Since the actual query will run against roughly 7 million rows, want to know the best possible query for the same.
Thanks.
On Fri, Jun 22, 2012 at 11:12 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:
> I think your syntax is wrong. Hive should support multi-column > distinct and at that point counting should work. > > You did: > select col1, count(distinct col2, col3) from dummy group by col1 > > I think the correct syntax is: > select col1, count(distinct (col2, col3)) from dummy group by col1 > > On Fri, Jun 22, 2012 at 1:29 PM, Mark Grover <[EMAIL PROTECTED]> wrote: > > A solution that comes to my mind is to use a union. Something like > (untested): > > > > select > > first, > > count(distinct second) > > from > > (select > > col1 as first, > > col2 as second > > from > > dummy > > union all > > select > > col1 as first, > > col3 as second > > from > > dummy > > )t > > group by > > first; > > > > Mark > > ----- Original Message ----- > > From: "Jan Dolinár" <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Friday, June 22, 2012 8:52:43 AM > > Subject: Re: count of distinct FROM multiple columns > > > > > > Hi > > > > > > A quick solution that comes first to my mind is to join the columns you > want to combine into an array and then use the explode UDTF: > > > > SELECT col1, COUNT(distinct combined) FROM dummy LATERAL VIEW > explode(array(col2, col3)) t AS combined GROUP BY col1; > > > > Although I believe there might be simpler and/or better solutions. > > > > > > Jan > > > > > > > > On Fri, Jun 22, 2012 at 2:32 PM, MIS < [EMAIL PROTECTED] > wrote: > > > > > > Hi All, > > > > I have a table in Hive as below: > > > > dummy { > > col1 STRING, > > col2 INT, > > col3 INT > > } > > > > And in that there is some sample data as : > > > > col1 col2 col3 > > ABC 4 5 > > XYZ 1 2 > > ABC 1 3 > > ABC 5 1 > > XYZ 3 1 > > > > What should be my query so as to get the below result: > > > > ABC 4 > > XYZ 3 > > > > Basically I'm trying to get a count of distinct elements from col2 and > col3 combined and group them against col1. > > > > I tried with a query as: > > > > select col1, count(distinct col2, col3) from dummy group by col1 > > > > But didn't get the expected output. > > > > Can anybody point me in the correct direction and suggest a correct > query. > > > > Thanks. > > >
|
|