Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Dealing with duplicate rows in Hive


Copy link to this message
-
Re: Dealing with duplicate rows in Hive
So you have 50 columns and out of them you want to use 9 columns for
finding unique rows?

am i correct in assuming that you want to make a key of combination of
these 9 columns so that you have just one row for a single combination of
these 9 columns ?
On Wed, Oct 2, 2013 at 6:07 AM, Philo Wang <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I am using Hive 8.1.8 in EMR.
>
> We have an extremely large table (~50 columns) where the uniqueness key is
> a combination of 9 different columns. I want to filter out any duplicate
> rows based on these 9 columns while retaining the ability to select other
> columns on an ad hoc basis. I don’t expect rows with the same uniqueness
> key to have different data, so I guess this can be generalized to just
> filtering out duplicate rows.
>
> My initial instinct was to do a “select distinct *” on the table and save
> the results into another table, but it appears that Hive does not support
> “distinct *”. Furthermore, Hive will apply distinct to every column in the
> select statement, so something like “select distinct(a), b” does not work
> either.
>
> The only option I could think of from here was to explicitly state all
> columns of the table inside the distinct statement, but this seems
> unnecessarily messy (again, the table contains more than 50 columns).
>
> Has anyone ran into a similar issue? Any insight would be appreciated.
>
> Thanks,
> Philo
>
>
--
Nitin Pawar