Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Dealing with duplicate rows in Hive


Copy link to this message
-
Re: Dealing with duplicate rows in Hive
Yes doing group by or distinct on 50 columns is ugly.
One option (ugly as well) is first select only these 9 columns and then do
a select * with join of the first .

something like
(select distinct cols from table) a join (select * from table b) on (a.col
= b.col)
I am really not sure this works but looks like a hack to make it work.

I will try to this in sometime and see if it works.
On Wed, Oct 2, 2013 at 1:28 PM, Philo Wang <[EMAIL PROTECTED]> wrote:

> Thanks for the suggestion! Unfortunately, if you use group by in a query
> all columns in the select statement must also appear in the group by. I can
> always select distinct on all 50 columns (or group by all 50 columns), but
> that sounds very extreme and I feel that there has to be a better solution
> out there. Something like a "select distinct *" seems like it would work
> here but according to https://issues.apache.org/jira/browse/HIVE-3199 that
> is not yet supported in Hive.
>
>
> On Wed, Oct 2, 2013 at 12:13 AM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>
>> may be you want to try group by
>>
>> in hive select distinct col1, col2, col3 works but if you want to select
>> all 50 columns its tricky.
>>
>> Other option would be group by all those 9 keys and it should take care
>> that you have combination of those 9 columns is unique.
>>
>>
>> On Wed, Oct 2, 2013 at 12:34 PM, Philo Wang <[EMAIL PROTECTED]> wrote:
>>
>>> Yes, that is correct.
>>>
>>>
>>> On Tue, Oct 1, 2013 at 11:21 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>>>
>>>> So you have 50 columns and out of them you want to use 9 columns for
>>>> finding unique rows?
>>>>
>>>> am i correct in assuming that you want to make a key of combination of
>>>> these 9 columns so that you have just one row for a single combination of
>>>> these 9 columns ?
>>>>
>>>>
>>>> On Wed, Oct 2, 2013 at 6:07 AM, Philo Wang <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am using Hive 8.1.8 in EMR.
>>>>>
>>>>> We have an extremely large table (~50 columns) where the uniqueness
>>>>> key is a combination of 9 different columns. I want to filter out any
>>>>> duplicate rows based on these 9 columns while retaining the ability to
>>>>> select other columns on an ad hoc basis. I don’t expect rows with the same
>>>>> uniqueness key to have different data, so I guess this can be generalized
>>>>> to just filtering out duplicate rows.
>>>>>
>>>>> My initial instinct was to do a “select distinct *” on the table and
>>>>> save the results into another table, but it appears that Hive does not
>>>>> support “distinct *”. Furthermore, Hive will apply distinct to every column
>>>>> in the select statement, so something like “select distinct(a), b” does not
>>>>> work either.
>>>>>
>>>>> The only option I could think of from here was to explicitly state all
>>>>> columns of the table inside the distinct statement, but this seems
>>>>> unnecessarily messy (again, the table contains more than 50 columns).
>>>>>
>>>>> Has anyone ran into a similar issue? Any insight would be appreciated.
>>>>>
>>>>> Thanks,
>>>>> Philo
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>
--
Nitin Pawar