Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> MR missing lines


Copy link to this message
-
Re: MR missing lines
Hi Anoop,

Thanks for the hint! Even if it's not fixing my issue, at least my
tests are going to be faster.

I will take a look at the documentation to understand what
deleteColumn was doing.

JM

2012/12/19, Anoop Sam John <[EMAIL PROTECTED]>:
> Jean:  just one thought after seeing the description and the code.. Not
> related to the missing as such
>
> You want to delete the row fully right?
>>My table is only one CF with one C with one version
> And your code is like
>>  Delete delete_entry_proposed = new Delete(key);
>>  delete_entry_proposed.deleteColumn(KVs.get(0).getFamily(),
>> KVs.get(0).getQualifier());
>
> deleteColumn() is useful when you want to delete specific column's specific
> version in a row.  In your case this may be really not needed. Just Delete
> delete_entry_proposed = new Delete(key);  may be enough so that the delete
> type is ROW delete.
>
> You can see the javadoc of the deleteColumn() API in which it clearly says
> it is an expensive op. At the server side there will be a need to do a Get
> call..
> In your case these are really unwanted over head .. I think...
>
> -Anoop-
> ________________________________________
> From: Jean-Marc Spaggiari [[EMAIL PROTECTED]]
> Sent: Tuesday, December 18, 2012 7:07 PM
> To: [EMAIL PROTECTED]
> Subject: Re: MR missing lines
>
> I faced the issue again today...
>
> RowCounter gave me 104313 lines
> Here is the output of the job counters:
> 12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_ADDED=81594
> 12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_SIMILAR=434
> 12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_NO_CHANGES=14250
> 12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_DUPLICATE=428
> 12/12/17 22:32:52 INFO mapred.JobClient:     NON_DELETED_ROWS=0
> 12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_EXISTING=7605
> 12/12/17 22:32:52 INFO mapred.JobClient:     ROWS_PARSED=104311
>
> There is a 2 lines difference between ROWS_PARSED and he counter.
> ENTRY_ADDED, ENTRY_SIMILAR, ENTRY_NO_CHANGES, ENTRY_DUPLICATE and
> ENTRY_EXISTING are the 5 states an entry can have. Total of all those
> counters is equal to the ROWS_PARSED value, so it's alligned. Code is
> handling all the possibilities.
>
> The ROWS_PARSED counter is incremented right at the beginning like
> that (I removed the comments and javadoc for lisibility):
>                 /**
>                  * The comments ...
>                  */
>                 @Override
>                 public void map(ImmutableBytesWritable row__, Result values,
> Context
> context) throws IOException
>                 {
>
>
> context.getCounter(Counters.ROWS_PARSED).increment(1);
>                         List<KeyValue> KVs = values.list();
>                         try
>                         {
>
>                                 // Get the current row.
>                                 byte[] key = values.getRow();
>
>                                 // First thing we do, we mark this line to
> be deleted.
>                                 Delete delete_entry_proposed = new
> Delete(key);
>
> delete_entry_proposed.deleteColumn(KVs.get(0).getFamily(),
> KVs.get(0).getQualifier());
>
> deletes_entry_proposed.add(delete_entry_proposed);
>
>
> The deletes_entry_proposed is a list of rows to delete. After each
> call to the delete method, the number of remaining lines into this
> list is added to NON_DELETED_ROWS which is 0 at the end, so all lines
> should be deleted correctly.
>
> I re-ran the rowcounter after the job, and I still have ROWS=5971
> lines into the table. I check all my "feeding process" and they are
> all closed.
>
> My table is only one CF with one C with one version.
>
> I can guess that the remaining 5971 lines into the table is an error
> on my side, but I'm not able to find where since all the counters are
> matching. I will add one counter which will add all the entries in the
> delete list before calling the delete method. This should match the
> number of rows.
>
> Again, I will re-feed the table today with fresh data and re-run the job...
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB