|
|
-
Eliminating rows with many KVs using a custom filter.David Koch 2012-08-17, 12:58
Hello,
I implemented and deployed a custom HBase filter. All it does is omit rows which contain more than <max> KeyValue pairs. The central part is implementing Filter filterKeyValue(): // "excludeRow" and "numKVs" are reset in reset() method. @Override public ReturnCode filterKeyValue(KeyValue kv) { if (++numKVs > maxKVs) { excludeRow = true; return ReturnCode.NEXT_ROW; } return ReturnCode.INCLUDE; } I was wondering if from a performance point of view it would be faster to instead override filterRow(List<KeyValue> kvs) and have something like: @Override public void filterRow(List<KeyValue> kvs) { if (kvs.size() > maxKVs) { excludeRow = true } } The disadvantage I see with this method is that it would have to load the entire list of kvs for each row first to establish whether or not to drop the row. This is potentially enough to bring down our cluster - see below. My implementation on the other hand has the overhead of the loop. I use this filter to eliminate abnormally large rows from the scan - rows contain about 10 KeyValues on average with low variance but a few outlier rows contain 1million+ KeyValue pairs. Doing a simple scan/get of those large rows brings down our region servers (using batch is not an option). Hence, the need to eliminate these rows as efficiently as possible from the processing pipeline. Thank you, /David PS: My options to compare both filter variants on big data are limited since we have only one HBase cluster - the production one ;-) +
Ramkrishna.S.Vasudevan 2012-08-17, 13:17
|