Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> MR missing lines

Copy link to this message
RE: MR missing lines
Hi All
           Be careful with selecting the Delete#deleteColumn() Delete#deleteColumns().
deleteColumn() API is to delete just one version of a column in a give row. While the other deletes all the versions data of the column.
In Jean's case which API is used will not matter in a functional way as he is having only one version for a column and even one column in every row.

But deleteColumn will be having an overhead. When this is used and not passing any TS ( latestTimeStamp by default comes in), there will be a get operation happening within the HRegion to get the ts of the most recent version for this column.   deleteColumn (cf,qualifier) API tells to delete the most recent version of the cf:qualifier while deleteColumns(cf,qualifier) tells to delete the whole column from the row (all the versions)

From: Jean-Marc Spaggiari [[EMAIL PROTECTED]]
Sent: Thursday, December 20, 2012 6:09 AM
Subject: Re: MR missing lines

Hi Anoop,

Thanks for the hint! Even if it's not fixing my issue, at least my
tests are going to be faster.

I will take a look at the documentation to understand what
deleteColumn was doing.


2012/12/19, Anoop Sam John <[EMAIL PROTECTED]>:
> Jean:  just one thought after seeing the description and the code.. Not
> related to the missing as such
> You want to delete the row fully right?
>>My table is only one CF with one C with one version
> And your code is like
>>  Delete delete_entry_proposed = new Delete(key);
>>  delete_entry_proposed.deleteColumn(KVs.get(0).getFamily(),
>> KVs.get(0).getQualifier());
> deleteColumn() is useful when you want to delete specific column's specific
> version in a row.  In your case this may be really not needed. Just Delete
> delete_entry_proposed = new Delete(key);  may be enough so that the delete
> type is ROW delete.
> You can see the javadoc of the deleteColumn() API in which it clearly says
> it is an expensive op. At the server side there will be a need to do a Get
> call..
> In your case these are really unwanted over head .. I think...
> -Anoop-
> ________________________________________
> From: Jean-Marc Spaggiari [[EMAIL PROTECTED]]
> Sent: Tuesday, December 18, 2012 7:07 PM
> Subject: Re: MR missing lines
> I faced the issue again today...
> RowCounter gave me 104313 lines
> Here is the output of the job counters:
> 12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_ADDED=81594
> 12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_SIMILAR=434
> 12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_NO_CHANGES=14250
> 12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_DUPLICATE=428
> 12/12/17 22:32:52 INFO mapred.JobClient:     NON_DELETED_ROWS=0
> 12/12/17 22:32:52 INFO mapred.JobClient:     ENTRY_EXISTING=7605
> 12/12/17 22:32:52 INFO mapred.JobClient:     ROWS_PARSED=104311
> There is a 2 lines difference between ROWS_PARSED and he counter.
> ENTRY_EXISTING are the 5 states an entry can have. Total of all those
> counters is equal to the ROWS_PARSED value, so it's alligned. Code is
> handling all the possibilities.
> The ROWS_PARSED counter is incremented right at the beginning like
> that (I removed the comments and javadoc for lisibility):
>                 /**
>                  * The comments ...
>                  */
>                 @Override
>                 public void map(ImmutableBytesWritable row__, Result values,
> Context
> context) throws IOException
>                 {
> context.getCounter(Counters.ROWS_PARSED).increment(1);
>                         List<KeyValue> KVs = values.list();
>                         try
>                         {
>                                 // Get the current row.
>                                 byte[] key = values.getRow();
>                                 // First thing we do, we mark this line to
> be deleted.