Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Eliminating rows with many KVs using a custom filter.


Copy link to this message
-
RE: Eliminating rows with many KVs using a custom filter.
Hi David

The first approach should be better.  If you know what are the columns that
you will always be retrieving, you can also use scan.addColumn() which is
much better.  May be you would have tried this already.

Regards
Ram

> -----Original Message-----
> From: David Koch [mailto:[EMAIL PROTECTED]]
> Sent: Friday, August 17, 2012 6:28 PM
> To: [EMAIL PROTECTED]
> Subject: Eliminating rows with many KVs using a custom filter.
>
> Hello,
>
> I implemented and deployed a custom HBase filter. All it does is omit
> rows
> which contain more than <max> KeyValue pairs. The central part is
> implementing Filter filterKeyValue():
>
> // "excludeRow" and "numKVs" are reset in reset() method.
> @Override
> public ReturnCode filterKeyValue(KeyValue kv) {
> if (++numKVs > maxKVs) {
> excludeRow = true;
> return ReturnCode.NEXT_ROW;
> }
> return ReturnCode.INCLUDE;
> }
>
> I was wondering if from a performance point of view it would be faster
> to
> instead override filterRow(List<KeyValue> kvs) and have something like:
>
> @Override
> public void filterRow(List<KeyValue> kvs) {
>        if (kvs.size() > maxKVs) {
>             excludeRow = true
>        }
> }
>
> The disadvantage I see with this method is that it would have to load
> the
> entire list of kvs for each row first to establish whether or not to
> drop
> the row. This is potentially enough to bring down our cluster - see
> below.
> My implementation on the other hand has the overhead of the loop.
>
> I use this filter to eliminate abnormally large rows from the scan -
> rows
> contain about 10 KeyValues on average with low variance but a few
> outlier
> rows contain 1million+ KeyValue pairs. Doing a simple scan/get of those
> large rows brings down our region servers (using batch is not an
> option).
> Hence, the need to eliminate these rows as efficiently as possible from
> the
> processing pipeline.
>
> Thank you,
>
> /David
>
>
> PS: My options to compare both filter variants on big data are limited
> since we have only one HBase cluster - the production one ;-)