-Re: Dealing with duplicate rows in Hive
Nitin Pawar 2013-10-02, 06:21
So you have 50 columns and out of them you want to use 9 columns for
finding unique rows?
am i correct in assuming that you want to make a key of combination of
these 9 columns so that you have just one row for a single combination of
these 9 columns ?
On Wed, Oct 2, 2013 at 6:07 AM, Philo Wang <[EMAIL PROTECTED]> wrote:
> I am using Hive 8.1.8 in EMR.
> We have an extremely large table (~50 columns) where the uniqueness key is
> a combination of 9 different columns. I want to filter out any duplicate
> rows based on these 9 columns while retaining the ability to select other
> columns on an ad hoc basis. I don’t expect rows with the same uniqueness
> key to have different data, so I guess this can be generalized to just
> filtering out duplicate rows.
> My initial instinct was to do a “select distinct *” on the table and save
> the results into another table, but it appears that Hive does not support
> “distinct *”. Furthermore, Hive will apply distinct to every column in the
> select statement, so something like “select distinct(a), b” does not work
> The only option I could think of from here was to explicitly state all
> columns of the table inside the distinct statement, but this seems
> unnecessarily messy (again, the table contains more than 50 columns).
> Has anyone ran into a similar issue? Any insight would be appreciated.