Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Identifying and Marking records as duplicates


+
Himanish Kushary 2012-08-17, 14:49
+
Bob Gause 2012-08-17, 15:46
+
Bertrand Dechoux 2012-08-17, 15:54
+
Himanish Kushary 2012-08-17, 16:05
Copy link to this message
-
Re: Identifying and Marking records as duplicates
Navis류승우 2012-08-20, 07:04
We've solved similar cases by using UDAF + UDTF, something like..

select assign(unique(a, b, c, d)) as (a, b, c, d), e, f from A group by a,
b, c, d;

unique(UDAF) returns struct and assign(UDTF) reassigns names for the
struct, where you can append a column for marking.

HIVE-2608 may make this shorter.

2012/8/18 Himanish Kushary <[EMAIL PROTECTED]>

> We want to separate them out into different tables for auditing purposes,
> so i was thinking of marking them and then
> separating them out based on the "duplicate" flag.
>
> Is there a better way of splitting them into 2 tables ? Also how would I
> pick up the first from the group and put into a different table
> once I have run the cross join on fld1 thru fld4 with order by timestamp?
>
> - Himanish
>
> On Fri, Aug 17, 2012 at 11:54 AM, Bertrand Dechoux <[EMAIL PROTECTED]>wrote:
>
>> If I understand correctly :
>> A cross join on fld1 thru fld4 with order by timestamp and take the first
>> of each group.
>> would allow you to have all the unique and original copies.
>>
>> But indeed if you don't have something like HBase, you can not mark them
>> really but you can select them and write them elsewhere.
>> Why do you specifically want to mark them?
>>
>> Bertrand
>>
>>
>> On Fri, Aug 17, 2012 at 5:46 PM, Bob Gause <[EMAIL PROTECTED]> wrote:
>>
>>> We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in
>>> some of our queries. You could do that, and then do another select where
>>> the row number/rank is 1 to get all the "unique" rows.
>>>
>>> There are probably a bunch of other ways to do this, but this is the one
>>> that first came to mind for me….
>>>
>>> Enjoy!
>>> Bob
>>>
>>> Robert Gause
>>> Senior Systems Engineer
>>> ZyQuest, Inc.
>>> [EMAIL PROTECTED]
>>>
>>> On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote:
>>>
>>> > Hi,
>>> >
>>> > We have a huge table which may have duplicate records.A record is
>>> considered duplicate based on 4 fields ( fld1 thru fld4) . We need to
>>> identify the duplicate records and possibly mark the duplicates(except the
>>> first record based on created time for a record).
>>> >
>>> > Is this something that could be done by hive or we need to write
>>> custom M/R for this.Could a inner join or a select with group by be used to
>>> find the duplicates ? How do I mark the duplicate records as there is no
>>> update.
>>> >
>>> > Whats the best way to do this using Hive ? Looking forward to hear the
>>> suggestions.
>>> >
>>> > Thanks
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>
>
> --
> Thanks & Regards
> Himanish
>
+
Himanish Kushary 2012-08-21, 17:29