Thanks for writing back. I am doing all explicit splits using addSplits in
the Java API since the keyspace is easy to divide evenly. Depending on the
table size for some of these experiments, I've had 128 splits, 256, 512, or
1024 splits. My jobs are executing properly, MR-wise, in the sense that I
do have a proper amount of map tasks created (as the count of splits above,
respectively). My concern is that the jobs may not be quite as busy as they
can be, dataflow-wise and I think the "Running Scans" per table/tablet
server seem to be good indicators of that.
My data is a 32-byte key (an md5 value), and I have one column family with
3 columns which contain "bigger" data, anywhere from 50-100k to an
occasional 10M-15M piece.
On Tue, Apr 2, 2013 at 10:06 AM, Josh Elser <[EMAIL PROTECTED]> wrote:
> Hi Marc,
> How many tablets are in the table you're running MR over (see the
> monitor)? Might adding some more splits to your table (`addsplits` in the
> Accumulo shell) get you better parallelism?
> What does your data look like in your table? Lots of small rows? Few very
> large rows?
> On 4/2/13 10:56 AM, Marc Reichman wrote:
>> I am running a accumulo-based MR job using the AccumuloRowInputFormat on
>> 1.4.1. Config is more-or-less default, using the native-standalone 3GB
>> template, but with the TServer memory put up to 2GB in accumulo-env.sh from
>> its default. accumulo-site.xml has tserver.memory.maps.max at 1G,
>> tserver.cache.data.size at 50M, and tserver.cache.index.size at 512M.
>> My tables are created with maxversions for all three types (scan, minc,
>> majc) at 1 and compress type as gz.
>> I am finding, on an 8 node test cluster with 64 map task slots, that when
>> a job is running, the 'Running Scans' count in the monitor is roughly 0-4
>> on average for each tablet server. When viewed at the table view, this puts
>> the running scans anywhere from 4-24 on average. I would expect/hope the
>> scans to be somewhere close to the map task count. To me, this means one of
>> the following.
>> 1. There is a configuration setting inhibiting the amount of scans from
>> accumulating (excuse the pun) to about the same amount as my map tasks
>> 2. My map task job is cpu-intensive enough to introduce delays between
>> scans and everything is fine
>> 3. Some combination of 1/2.
>> On an alternate cluster, 40 nodes with 320 task slots, we haven't seen
>> anywhere near full capacity scanning with map tasks which have the same
>> performance, and the problem seems much worse.
>> I am experimenting with some of the readahead configuration variables for
>> the tablet servers in the meantime, but haven't found any smoking guns yet.
>> Thank you,
>> http://saucyandbossy.**wordpress.com <http://saucyandbossy.wordpress.com>