-Re: Identifying and Marking records as duplicates
Navis류승우 2012-08-20, 07:04
We've solved similar cases by using UDAF + UDTF, something like..
select assign(unique(a, b, c, d)) as (a, b, c, d), e, f from A group by a,
b, c, d;
unique(UDAF) returns struct and assign(UDTF) reassigns names for the
struct, where you can append a column for marking.
HIVE-2608 may make this shorter.
2012/8/18 Himanish Kushary <[EMAIL PROTECTED]>
> We want to separate them out into different tables for auditing purposes,
> so i was thinking of marking them and then
> separating them out based on the "duplicate" flag.
> Is there a better way of splitting them into 2 tables ? Also how would I
> pick up the first from the group and put into a different table
> once I have run the cross join on fld1 thru fld4 with order by timestamp?
> - Himanish
> On Fri, Aug 17, 2012 at 11:54 AM, Bertrand Dechoux <[EMAIL PROTECTED]>wrote:
>> If I understand correctly :
>> A cross join on fld1 thru fld4 with order by timestamp and take the first
>> of each group.
>> would allow you to have all the unique and original copies.
>> But indeed if you don't have something like HBase, you can not mark them
>> really but you can select them and write them elsewhere.
>> Why do you specifically want to mark them?
>> On Fri, Aug 17, 2012 at 5:46 PM, Bob Gause <[EMAIL PROTECTED]> wrote:
>>> We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in
>>> some of our queries. You could do that, and then do another select where
>>> the row number/rank is 1 to get all the "unique" rows.
>>> There are probably a bunch of other ways to do this, but this is the one
>>> that first came to mind for me….
>>> Robert Gause
>>> Senior Systems Engineer
>>> ZyQuest, Inc.
>>> [EMAIL PROTECTED]
>>> On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote:
>>> > Hi,
>>> > We have a huge table which may have duplicate records.A record is
>>> considered duplicate based on 4 fields ( fld1 thru fld4) . We need to
>>> identify the duplicate records and possibly mark the duplicates(except the
>>> first record based on created time for a record).
>>> > Is this something that could be done by hive or we need to write
>>> custom M/R for this.Could a inner join or a select with group by be used to
>>> find the duplicates ? How do I mark the duplicate records as there is no
>>> > Whats the best way to do this using Hive ? Looking forward to hear the
>>> > Thanks
>> Bertrand Dechoux
> Thanks & Regards