Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - Combiner behaviour


Copy link to this message
-
Re: Combiner behaviour
Josh Elser 2014-03-24, 22:42
Russ,

Check out https://github.com/joshelser/accumulo-column-summing

Using the SummingCombiner with a call to
ScannerBase#fetchColumn(Text,Text) will be a pretty decent solution for
modest data sets. The (better articulated than previously) reason why
the SummingCombiner is sub-par is that it only sums within a single row
and not across rows. This is the reason why making a custom iterator to
sum across rows is desirable.

Some results you can try running this microbenchmark from the test class
in the above repository. It creates a table with 1M rows, 7 columns per
row, and sums over a single column. We can lower the split threshold on
our table to split it out into more Tablets which should give more
realistic performance (pay the penalty for the RPC calls that you would
at "scale"). The reduction in number of keys returned (and thus the
amount of data over the wire) should be the primary reason this approach
is desirable.

Hope this makes things clearer!

Number of splits for table: 65
Number of results to sum: 66
Time for iterator: 4482 ms
Number of results to sum: 1000000
Time for combiner: 4314 ms

Number of results to sum: 66
Time for iterator: 3651 ms
Number of results to sum: 1000000
Time for combiner: 3754 ms

Number of results to sum: 66
Time for iterator: 3685 ms
Number of results to sum: 1000000
Time for combiner: 3839 ms

Number of results to sum: 66
Time for iterator: 3643 ms
Number of results to sum: 1000000
Time for combiner: 4066 ms

Number of results to sum: 66
Time for iterator: 3880 ms
Number of results to sum: 1000000
Time for combiner: 4084 ms

On 3/20/14, 9:49 PM, Josh Elser wrote: