Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> MR missing lines

Jean-Marc Spaggiari 2012-12-16, 12:52
Kevin Odell 2012-12-16, 14:05
Asaf Mesika 2012-12-16, 16:28
Jean-Marc Spaggiari 2012-12-17, 00:20
Jean-Marc Spaggiari 2012-12-17, 12:15
Jean-Marc Spaggiari 2012-12-18, 13:37
Copy link to this message
RE: MR missing lines
Jean:  just one thought after seeing the description and the code.. Not related to the missing as such

You want to delete the row fully right?
>My table is only one CF with one C with one version
And your code is like
>  Delete delete_entry_proposed = new Delete(key);
>  delete_entry_proposed.deleteColumn(KVs.get(0).getFamily(), KVs.get(0).getQualifier());

deleteColumn() is useful when you want to delete specific column's specific version in a row.  In your case this may be really not needed. Just Delete delete_entry_proposed = new Delete(key);  may be enough so that the delete type is ROW delete.

You can see the javadoc of the deleteColumn() API in which it clearly says it is an expensive op. At the server side there will be a need to do a Get call..
In your case these are really unwanted over head .. I think...

From: Jean-Marc Spaggiari [[EMAIL PROTECTED]]
Sent: Tuesday, December 18, 2012 7:07 PM
Subject: Re: MR missing lines

I faced the issue again today...

RowCounter gave me 104313 lines
Here is the output of the job counters:
12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_ADDED=81594
12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_SIMILAR=434
12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_NO_CHANGES=14250
12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_DUPLICATE=428
12/12/17 22:32:52 INFO mapred.JobClient:     NON_DELETED_ROWS=0
12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_EXISTING=7605
12/12/17 22:32:52 INFO mapred.JobClient:     ROWS_PARSED=104311

There is a 2 lines difference between ROWS_PARSED and he counter.
ENTRY_EXISTING are the 5 states an entry can have. Total of all those
counters is equal to the ROWS_PARSED value, so it's alligned. Code is
handling all the possibilities.

The ROWS_PARSED counter is incremented right at the beginning like
that (I removed the comments and javadoc for lisibility):
                 * The comments ...
                public void map(ImmutableBytesWritable row__, Result values, Context
context) throws IOException

                        List<KeyValue> KVs = values.list();

                                // Get the current row.
                                byte[] key = values.getRow();

                                // First thing we do, we mark this line to be deleted.
                                Delete delete_entry_proposed = new Delete(key);
The deletes_entry_proposed is a list of rows to delete. After each
call to the delete method, the number of remaining lines into this
list is added to NON_DELETED_ROWS which is 0 at the end, so all lines
should be deleted correctly.

I re-ran the rowcounter after the job, and I still have ROWS=5971
lines into the table. I check all my "feeding process" and they are
all closed.

My table is only one CF with one C with one version.

I can guess that the remaining 5971 lines into the table is an error
on my side, but I'm not able to find where since all the counters are
matching. I will add one counter which will add all the entries in the
delete list before calling the delete method. This should match the
number of rows.

Again, I will re-feed the table today with fresh data and re-run the job...


2012/12/17, Jean-Marc Spaggiari <[EMAIL PROTECTED]>:
> The job run the morning, and of course, this time, all the rows got
> processed ;)
> So I will give it few other tries and will keep you posted if I'm able
> to reproduce that again.
> Thanks,
Jean-Marc Spaggiari 2012-12-20, 00:39
Anoop Sam John 2012-12-20, 04:24
Harsh J 2012-12-16, 14:41