Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Dealing with duplicate rows in Hive


Copy link to this message
-
Re: Dealing with duplicate rows in Hive
So you have 50 columns and out of them you want to use 9 columns for
finding unique rows?

am i correct in assuming that you want to make a key of combination of
these 9 columns so that you have just one row for a single combination of
these 9 columns ?
On Wed, Oct 2, 2013 at 6:07 AM, Philo Wang <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I am using Hive 8.1.8 in EMR.
>
> We have an extremely large table (~50 columns) where the uniqueness key is
> a combination of 9 different columns. I want to filter out any duplicate
> rows based on these 9 columns while retaining the ability to select other
> columns on an ad hoc basis. I don’t expect rows with the same uniqueness
> key to have different data, so I guess this can be generalized to just
> filtering out duplicate rows.
>
> My initial instinct was to do a “select distinct *” on the table and save
> the results into another table, but it appears that Hive does not support
> “distinct *”. Furthermore, Hive will apply distinct to every column in the
> select statement, so something like “select distinct(a), b” does not work
> either.
>
> The only option I could think of from here was to explicitly state all
> columns of the table inside the distinct statement, but this seems
> unnecessarily messy (again, the table contains more than 50 columns).
>
> Has anyone ran into a similar issue? Any insight would be appreciated.
>
> Thanks,
> Philo
>
>
--
Nitin Pawar
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB