Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - MR missing lines


Copy link to this message
-
Re: MR missing lines
Jean-Marc Spaggiari 2012-12-18, 13:37
I faced the issue again today...

RowCounter gave me 104313 lines
Here is the output of the job counters:
12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_ADDED=81594
12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_SIMILAR=434
12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_NO_CHANGES=14250
12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_DUPLICATE=428
12/12/17 22:32:52 INFO mapred.JobClient:     NON_DELETED_ROWS=0
12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_EXISTING=7605
12/12/17 22:32:52 INFO mapred.JobClient:     ROWS_PARSED=104311

There is a 2 lines difference between ROWS_PARSED and he counter.
ENTRY_ADDED, ENTRY_SIMILAR, ENTRY_NO_CHANGES, ENTRY_DUPLICATE and
ENTRY_EXISTING are the 5 states an entry can have. Total of all those
counters is equal to the ROWS_PARSED value, so it's alligned. Code is
handling all the possibilities.

The ROWS_PARSED counter is incremented right at the beginning like
that (I removed the comments and javadoc for lisibility):
/**
* The comments ...
*/
@Override
public void map(ImmutableBytesWritable row__, Result values, Context
context) throws IOException
{

context.getCounter(Counters.ROWS_PARSED).increment(1);
List<KeyValue> KVs = values.list();
try
{

// Get the current row.
byte[] key = values.getRow();

// First thing we do, we mark this line to be deleted.
Delete delete_entry_proposed = new Delete(key);
delete_entry_proposed.deleteColumn(KVs.get(0).getFamily(),
KVs.get(0).getQualifier());
deletes_entry_proposed.add(delete_entry_proposed);
The deletes_entry_proposed is a list of rows to delete. After each
call to the delete method, the number of remaining lines into this
list is added to NON_DELETED_ROWS which is 0 at the end, so all lines
should be deleted correctly.

I re-ran the rowcounter after the job, and I still have ROWS=5971
lines into the table. I check all my "feeding process" and they are
all closed.

My table is only one CF with one C with one version.

I can guess that the remaining 5971 lines into the table is an error
on my side, but I'm not able to find where since all the counters are
matching. I will add one counter which will add all the entries in the
delete list before calling the delete method. This should match the
number of rows.

Again, I will re-feed the table today with fresh data and re-run the job...

JM

2012/12/17, Jean-Marc Spaggiari <[EMAIL PROTECTED]>:
> The job run the morning, and of course, this time, all the rows got
> processed ;)
>
> So I will give it few other tries and will keep you posted if I'm able
> to reproduce that again.
>
> Thanks,
>
> JM
>
> 2012/12/16, Jean-Marc Spaggiari <[EMAIL PROTECTED]>:
>> Thanks for the suggestions.
>>
>> I already have logs to display all the exepctions and there is
>> nothing. I can't display the work done, there is to much :(
>>
>> I have counters "counting" the rows processed and they match what is
>> done, minus what is not processed. I have just added few other
>> counters. One right at the beginning, and one to count what are the
>> records remaining on the delete list, as suggested.
>>
>> I will run the job again tomorrow, see the result and keep you posted.
>>
>> JM
>>
>>
>> 2012/12/16, Asaf Mesika <[EMAIL PROTECTED]>:
>>> Did you check the returned array of the delete method to make sure all
>>> records sent for delete have been deleted?
>>>
>>> Sent from my iPhone
>>>
>>> On 16 בדצמ 2012, at 14:52, Jean-Marc Spaggiari <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a table where I'm running MR each time is exceding 100 000 rows.
>>>>
>>>> When the target is reached, all the feeding process are stopped.
>>>>
>>>> Yesterday it reached 123608 rows. So I stopped the feeding process,
>>>> and ran the MR.
>>>>
>>>> For each line, the MR is creating a delete. The delete is placed on a
>>>> list, and when the list reached 10 elements, it's sent to the table.
>>>> In the clean method, the list is sent to the table if there is any