Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Dealing with duplicate rows in Hive


Copy link to this message
-
Re: Dealing with duplicate rows in Hive
Yes doing group by or distinct on 50 columns is ugly.
One option (ugly as well) is first select only these 9 columns and then do
a select * with join of the first .

something like
(select distinct cols from table) a join (select * from table b) on (a.col
= b.col)
I am really not sure this works but looks like a hack to make it work.

I will try to this in sometime and see if it works.
On Wed, Oct 2, 2013 at 1:28 PM, Philo Wang <[EMAIL PROTECTED]> wrote:

> Thanks for the suggestion! Unfortunately, if you use group by in a query
> all columns in the select statement must also appear in the group by. I can
> always select distinct on all 50 columns (or group by all 50 columns), but
> that sounds very extreme and I feel that there has to be a better solution
> out there. Something like a "select distinct *" seems like it would work
> here but according to https://issues.apache.org/jira/browse/HIVE-3199 that
> is not yet supported in Hive.
>
>
> On Wed, Oct 2, 2013 at 12:13 AM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>
>> may be you want to try group by
>>
>> in hive select distinct col1, col2, col3 works but if you want to select
>> all 50 columns its tricky.
>>
>> Other option would be group by all those 9 keys and it should take care
>> that you have combination of those 9 columns is unique.
>>
>>
>> On Wed, Oct 2, 2013 at 12:34 PM, Philo Wang <[EMAIL PROTECTED]> wrote:
>>
>>> Yes, that is correct.
>>>
>>>
>>> On Tue, Oct 1, 2013 at 11:21 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>>>
>>>> So you have 50 columns and out of them you want to use 9 columns for
>>>> finding unique rows?
>>>>
>>>> am i correct in assuming that you want to make a key of combination of
>>>> these 9 columns so that you have just one row for a single combination of
>>>> these 9 columns ?
>>>>
>>>>
>>>> On Wed, Oct 2, 2013 at 6:07 AM, Philo Wang <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am using Hive 8.1.8 in EMR.
>>>>>
>>>>> We have an extremely large table (~50 columns) where the uniqueness
>>>>> key is a combination of 9 different columns. I want to filter out any
>>>>> duplicate rows based on these 9 columns while retaining the ability to
>>>>> select other columns on an ad hoc basis. I don’t expect rows with the same
>>>>> uniqueness key to have different data, so I guess this can be generalized
>>>>> to just filtering out duplicate rows.
>>>>>
>>>>> My initial instinct was to do a “select distinct *” on the table and
>>>>> save the results into another table, but it appears that Hive does not
>>>>> support “distinct *”. Furthermore, Hive will apply distinct to every column
>>>>> in the select statement, so something like “select distinct(a), b” does not
>>>>> work either.
>>>>>
>>>>> The only option I could think of from here was to explicitly state all
>>>>> columns of the table inside the distinct statement, but this seems
>>>>> unnecessarily messy (again, the table contains more than 50 columns).
>>>>>
>>>>> Has anyone ran into a similar issue? Any insight would be appreciated.
>>>>>
>>>>> Thanks,
>>>>> Philo
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>
--
Nitin Pawar
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB