-Re: Dealing with duplicate rows in Hive
Nitin Pawar 2013-10-02, 07:13
may be you want to try group by
in hive select distinct col1, col2, col3 works but if you want to select
all 50 columns its tricky.
Other option would be group by all those 9 keys and it should take care
that you have combination of those 9 columns is unique.
On Wed, Oct 2, 2013 at 12:34 PM, Philo Wang <[EMAIL PROTECTED]> wrote:
> Yes, that is correct.
> On Tue, Oct 1, 2013 at 11:21 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>> So you have 50 columns and out of them you want to use 9 columns for
>> finding unique rows?
>> am i correct in assuming that you want to make a key of combination of
>> these 9 columns so that you have just one row for a single combination of
>> these 9 columns ?
>> On Wed, Oct 2, 2013 at 6:07 AM, Philo Wang <[EMAIL PROTECTED]> wrote:
>>> I am using Hive 8.1.8 in EMR.
>>> We have an extremely large table (~50 columns) where the uniqueness key
>>> is a combination of 9 different columns. I want to filter out any duplicate
>>> rows based on these 9 columns while retaining the ability to select other
>>> columns on an ad hoc basis. I don’t expect rows with the same uniqueness
>>> key to have different data, so I guess this can be generalized to just
>>> filtering out duplicate rows.
>>> My initial instinct was to do a “select distinct *” on the table and
>>> save the results into another table, but it appears that Hive does not
>>> support “distinct *”. Furthermore, Hive will apply distinct to every column
>>> in the select statement, so something like “select distinct(a), b” does not
>>> work either.
>>> The only option I could think of from here was to explicitly state all
>>> columns of the table inside the distinct statement, but this seems
>>> unnecessarily messy (again, the table contains more than 50 columns).
>>> Has anyone ran into a similar issue? Any insight would be appreciated.
>> Nitin Pawar