Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Eliminating rows with many KVs using a custom filter.


Copy link to this message
-
Eliminating rows with many KVs using a custom filter.
Hello,

I implemented and deployed a custom HBase filter. All it does is omit rows
which contain more than <max> KeyValue pairs. The central part is
implementing Filter filterKeyValue():

// "excludeRow" and "numKVs" are reset in reset() method.
@Override
public ReturnCode filterKeyValue(KeyValue kv) {
if (++numKVs > maxKVs) {
excludeRow = true;
return ReturnCode.NEXT_ROW;
}
return ReturnCode.INCLUDE;
}

I was wondering if from a performance point of view it would be faster to
instead override filterRow(List<KeyValue> kvs) and have something like:

@Override
public void filterRow(List<KeyValue> kvs) {
       if (kvs.size() > maxKVs) {
            excludeRow = true
       }
}

The disadvantage I see with this method is that it would have to load the
entire list of kvs for each row first to establish whether or not to drop
the row. This is potentially enough to bring down our cluster - see below.
My implementation on the other hand has the overhead of the loop.

I use this filter to eliminate abnormally large rows from the scan - rows
contain about 10 KeyValues on average with low variance but a few outlier
rows contain 1million+ KeyValue pairs. Doing a simple scan/get of those
large rows brings down our region servers (using batch is not an option).
Hence, the need to eliminate these rows as efficiently as possible from the
processing pipeline.

Thank you,

/David
PS: My options to compare both filter variants on big data are limited
since we have only one HBase cluster - the production one ;-)