How many tablets are in the table you're running MR over (see the
monitor)? Might adding some more splits to your table (`addsplits` in
the Accumulo shell) get you better parallelism?
What does your data look like in your table? Lots of small rows? Few
very large rows?
On 4/2/13 10:56 AM, Marc Reichman wrote:
> I am running a accumulo-based MR job using the AccumuloRowInputFormat
> on 1.4.1. Config is more-or-less default, using the native-standalone
> 3GB template, but with the TServer memory put up to 2GB in
> accumulo-env.sh from its default. accumulo-site.xml has
> tserver.memory.maps.max at 1G, tserver.cache.data.size at 50M, and
> tserver.cache.index.size at 512M.
> My tables are created with maxversions for all three types (scan,
> minc, majc) at 1 and compress type as gz.
> I am finding, on an 8 node test cluster with 64 map task slots, that
> when a job is running, the 'Running Scans' count in the monitor is
> roughly 0-4 on average for each tablet server. When viewed at the
> table view, this puts the running scans anywhere from 4-24 on average.
> I would expect/hope the scans to be somewhere close to the map task
> count. To me, this means one of the following.
> 1. There is a configuration setting inhibiting the amount of scans
> from accumulating (excuse the pun) to about the same amount as my map
> 2. My map task job is cpu-intensive enough to introduce delays between
> scans and everything is fine
> 3. Some combination of 1/2.
> On an alternate cluster, 40 nodes with 320 task slots, we haven't seen
> anywhere near full capacity scanning with map tasks which have the
> same performance, and the problem seems much worse.
> I am experimenting with some of the readahead configuration variables
> for the tablet servers in the meantime, but haven't found any smoking
> guns yet.
> Thank you,