Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # dev - Re: Bulkload discards duplicates


+
lars hofhansl 2012-03-12, 16:41
+
Stack 2012-03-12, 15:20
Copy link to this message
-
RE: Bulkload discards duplicates
Laxman 2012-03-12, 16:50
Thanks for the quick response stack.

I tested again with the proposed patch.
> > Changing this back to List and then sort explicitly will solve the
issue.

Still the same problem persists making this issue bit more complicated.

Moving further discussion to JIRA.

--
Regards,
Laxman
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of
> Stack
> Sent: Monday, March 12, 2012 8:50 PM
> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: Bulkload discards duplicates
>
> On Mon, Mar 12, 2012 at 8:17 AM, Laxman <[EMAIL PROTECTED]> wrote:
> > In our test, we noticed that bulkload is discarding the duplicates.
> > On further analysis, I noticed duplicates are getting discarded only
> > duplicates exists in same input file and in same split.
> > I think this is a bug and its not any intentional behavior.
> >
> > Usage of TreeSet in the below code snippet is causing the issue.
> >
> > PutSortReducer.reduce()
> > =====================> >      TreeSet<KeyValue> map = new
> TreeSet<KeyValue>(KeyValue.COMPARATOR);
> >      long curSize = 0;
> >      // stop at the end or the RAM threshold
> >      while (iter.hasNext() && curSize < threshold) {
> >        Put p = iter.next();
> >        for (List<KeyValue> kvs : p.getFamilyMap().values()) {
> >          for (KeyValue kv : kvs) {
> >            map.add(kv);
> >            curSize += kv.getLength();
> >          }
> >        }
> >
> > Changing this back to List and then sort explicitly will solve the
> issue.
> >
> > Filed a new JIRA for this
> > https://issues.apache.org/jira/browse/HBASE-5564
>
> Thank you for finding the issue and making a JIRA.
> St.Ack