Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> MR missing lines


Copy link to this message
-
Re: MR missing lines
I faced the issue again today...

RowCounter gave me 104313 lines
Here is the output of the job counters:
12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_ADDED=81594
12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_SIMILAR=434
12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_NO_CHANGES=14250
12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_DUPLICATE=428
12/12/17 22:32:52 INFO mapred.JobClient:     NON_DELETED_ROWS=0
12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_EXISTING=7605
12/12/17 22:32:52 INFO mapred.JobClient:     ROWS_PARSED=104311

There is a 2 lines difference between ROWS_PARSED and he counter.
ENTRY_ADDED, ENTRY_SIMILAR, ENTRY_NO_CHANGES, ENTRY_DUPLICATE and
ENTRY_EXISTING are the 5 states an entry can have. Total of all those
counters is equal to the ROWS_PARSED value, so it's alligned. Code is
handling all the possibilities.

The ROWS_PARSED counter is incremented right at the beginning like
that (I removed the comments and javadoc for lisibility):
/**
* The comments ...
*/
@Override
public void map(ImmutableBytesWritable row__, Result values, Context
context) throws IOException
{

context.getCounter(Counters.ROWS_PARSED).increment(1);
List<KeyValue> KVs = values.list();
try
{

// Get the current row.
byte[] key = values.getRow();

// First thing we do, we mark this line to be deleted.
Delete delete_entry_proposed = new Delete(key);
delete_entry_proposed.deleteColumn(KVs.get(0).getFamily(),
KVs.get(0).getQualifier());
deletes_entry_proposed.add(delete_entry_proposed);
The deletes_entry_proposed is a list of rows to delete. After each
call to the delete method, the number of remaining lines into this
list is added to NON_DELETED_ROWS which is 0 at the end, so all lines
should be deleted correctly.

I re-ran the rowcounter after the job, and I still have ROWS=5971
lines into the table. I check all my "feeding process" and they are
all closed.

My table is only one CF with one C with one version.

I can guess that the remaining 5971 lines into the table is an error
on my side, but I'm not able to find where since all the counters are
matching. I will add one counter which will add all the entries in the
delete list before calling the delete method. This should match the
number of rows.

Again, I will re-feed the table today with fresh data and re-run the job...

JM

2012/12/17, Jean-Marc Spaggiari <[EMAIL PROTECTED]>:
> The job run the morning, and of course, this time, all the rows got
> processed ;)
>
> So I will give it few other tries and will keep you posted if I'm able
> to reproduce that again.
>
> Thanks,
>
> JM
>
> 2012/12/16, Jean-Marc Spaggiari <[EMAIL PROTECTED]>:
>> Thanks for the suggestions.
>>
>> I already have logs to display all the exepctions and there is
>> nothing. I can't display the work done, there is to much :(
>>
>> I have counters "counting" the rows processed and they match what is
>> done, minus what is not processed. I have just added few other
>> counters. One right at the beginning, and one to count what are the
>> records remaining on the delete list, as suggested.
>>
>> I will run the job again tomorrow, see the result and keep you posted.
>>
>> JM
>>
>>
>> 2012/12/16, Asaf Mesika <[EMAIL PROTECTED]>:
>>> Did you check the returned array of the delete method to make sure all
>>> records sent for delete have been deleted?
>>>
>>> Sent from my iPhone
>>>
>>> On 16 בדצמ 2012, at 14:52, Jean-Marc Spaggiari <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a table where I'm running MR each time is exceding 100 000 rows.
>>>>
>>>> When the target is reached, all the feeding process are stopped.
>>>>
>>>> Yesterday it reached 123608 rows. So I stopped the feeding process,
>>>> and ran the MR.
>>>>
>>>> For each line, the MR is creating a delete. The delete is placed on a
>>>> list, and when the list reached 10 elements, it's sent to the table.
>>>> In the clean method, the list is sent to the table if there is any
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB