Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Identifying and Marking records as duplicates


+
Himanish Kushary 2012-08-17, 14:49
+
Bob Gause 2012-08-17, 15:46
+
Bertrand Dechoux 2012-08-17, 15:54
+
Himanish Kushary 2012-08-17, 16:05
Copy link to this message
-
Re: Identifying and Marking records as duplicates
We've solved similar cases by using UDAF + UDTF, something like..

select assign(unique(a, b, c, d)) as (a, b, c, d), e, f from A group by a,
b, c, d;

unique(UDAF) returns struct and assign(UDTF) reassigns names for the
struct, where you can append a column for marking.

HIVE-2608 may make this shorter.

2012/8/18 Himanish Kushary <[EMAIL PROTECTED]>

> We want to separate them out into different tables for auditing purposes,
> so i was thinking of marking them and then
> separating them out based on the "duplicate" flag.
>
> Is there a better way of splitting them into 2 tables ? Also how would I
> pick up the first from the group and put into a different table
> once I have run the cross join on fld1 thru fld4 with order by timestamp?
>
> - Himanish
>
> On Fri, Aug 17, 2012 at 11:54 AM, Bertrand Dechoux <[EMAIL PROTECTED]>wrote:
>
>> If I understand correctly :
>> A cross join on fld1 thru fld4 with order by timestamp and take the first
>> of each group.
>> would allow you to have all the unique and original copies.
>>
>> But indeed if you don't have something like HBase, you can not mark them
>> really but you can select them and write them elsewhere.
>> Why do you specifically want to mark them?
>>
>> Bertrand
>>
>>
>> On Fri, Aug 17, 2012 at 5:46 PM, Bob Gause <[EMAIL PROTECTED]> wrote:
>>
>>> We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in
>>> some of our queries. You could do that, and then do another select where
>>> the row number/rank is 1 to get all the "unique" rows.
>>>
>>> There are probably a bunch of other ways to do this, but this is the one
>>> that first came to mind for me….
>>>
>>> Enjoy!
>>> Bob
>>>
>>> Robert Gause
>>> Senior Systems Engineer
>>> ZyQuest, Inc.
>>> [EMAIL PROTECTED]
>>>
>>> On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote:
>>>
>>> > Hi,
>>> >
>>> > We have a huge table which may have duplicate records.A record is
>>> considered duplicate based on 4 fields ( fld1 thru fld4) . We need to
>>> identify the duplicate records and possibly mark the duplicates(except the
>>> first record based on created time for a record).
>>> >
>>> > Is this something that could be done by hive or we need to write
>>> custom M/R for this.Could a inner join or a select with group by be used to
>>> find the duplicates ? How do I mark the duplicate records as there is no
>>> update.
>>> >
>>> > Whats the best way to do this using Hive ? Looking forward to hear the
>>> suggestions.
>>> >
>>> > Thanks
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>
>
> --
> Thanks & Regards
> Himanish
>
+
Himanish Kushary 2012-08-21, 17:29
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB