Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Identifying and Marking records as duplicates


Copy link to this message
-
Re: Identifying and Marking records as duplicates
Bertrand Dechoux 2012-08-17, 15:54
If I understand correctly :
A cross join on fld1 thru fld4 with order by timestamp and take the first
of each group.
would allow you to have all the unique and original copies.

But indeed if you don't have something like HBase, you can not mark them
really but you can select them and write them elsewhere.
Why do you specifically want to mark them?

Bertrand

On Fri, Aug 17, 2012 at 5:46 PM, Bob Gause <[EMAIL PROTECTED]> wrote:

> We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in some
> of our queries. You could do that, and then do another select where the row
> number/rank is 1 to get all the "unique" rows.
>
> There are probably a bunch of other ways to do this, but this is the one
> that first came to mind for me….
>
> Enjoy!
> Bob
>
> Robert Gause
> Senior Systems Engineer
> ZyQuest, Inc.
> [EMAIL PROTECTED]
>
> On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote:
>
> > Hi,
> >
> > We have a huge table which may have duplicate records.A record is
> considered duplicate based on 4 fields ( fld1 thru fld4) . We need to
> identify the duplicate records and possibly mark the duplicates(except the
> first record based on created time for a record).
> >
> > Is this something that could be done by hive or we need to write custom
> M/R for this.Could a inner join or a select with group by be used to find
> the duplicates ? How do I mark the duplicate records as there is no update.
> >
> > Whats the best way to do this using Hive ? Looking forward to hear the
> suggestions.
> >
> > Thanks
>
>
--
Bertrand Dechoux