Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - Straggler problem in Accumulo BatchScans


Copy link to this message
-
Re: Straggler problem in Accumulo BatchScans
Eric Newton 2013-08-21, 23:53
You can write your own balancer, and use it just for your table.

-Eric
On Wed, Aug 21, 2013 at 7:47 PM, Slater, David M.
<[EMAIL PROTECTED]>wrote:

> Hi James,****
>
> ** **
>
> I already had the data sharded into 7 partitions, and that works well to
> distribute the data into 7 tablets. (I have 2 GB tablet sizes, with about
> 1.2 TB of data, so there are numerous tablets per server). The difficulty
> is that Accumulo seems to decide for itself where each tablet goes. When I
> only had 10 GB of data, it nicely divided into 7 tablets, one on each node.
> However, with dozens of tablets per tablet server, it assigns tablets to
> tablet servers orthogonally to my presplits. ****
>
> ** **
>
> Is there a way to force Accumulo to keep specific ranges on specific
> nodes? If not, I suppose that I could have 4x or more shards per tablet
> server to ensure that it was more uniformly placed.****
>
> ** **
>
> D****
>
> ** **
>
> *From:* James Hughes [mailto:[EMAIL PROTECTED]]
> *Sent:* Wednesday, August 21, 2013 7:29 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Straggler problem in Accumulo BatchScans****
>
> ** **
>
> David,****
>
> Each tablet is hosted by one tablet server, and there's no way around
> that.  (This is actually quite reasonably; otherwise, we would receive
> duplicate results from multiple tablet servers.)  ****
>
> One strategy to deal with imbalanced data is to add a random partition
> prefix to your row Ids.  This does complicate building queries, but in
> general, you'll be able to leverage all of your nodes.  I did some testing
> with the nodes of such random shard ids, and it seems like having 1-2x as
> many shards as tablet servers worked pretty well.  (I'd suggest 2x in case
> you ever grow your cloud.)****
>
> In particular, if you can reingest your data, prepend a random "01-14~" to
> the beginning of each row Id, and see if that helps.  After that, you can
> "help" Accumulo decide where it should split tablets with addSplits 01 02
> <etc> 14 from the Accumulo shell (or programmatically with the addSplits).
> After that, you can make sure that your 14+ splits are distributed across
> the 7 nodes in a reasonable way.  ****
>
> I hope that helps,
>
> Jim
>
>
> http://accumulo.apache.org/1.4/apidocs/org/apache/accumulo/core/client/admin/TableOperations.html#addSplits%28java.lang.String,%20java.util.SortedSet%29
> ****
>
> ** **
>
> On Wed, Aug 21, 2013 at 7:09 PM, Slater, David M. <[EMAIL PROTECTED]>
> wrote:****
>
> Hey, I have a 7 node network running accumulo 1.4.1 and hadoop 1.0.4.****
>
>  ****
>
> When I run large BatchScanner operations, the number of tablets scanned
> per node is not uniform, leading to the overloaded nodes taking much longer
> to finish than the others. For queries that require all of the scans to
> finish before returning, this is a major latency issue. What are some
> practical means of load-balancing this to reduce delay?****
>
>  ****
>
> Is it possible for tablets to be hosted on multiple tablet servers, up to
> the replication factor of the underlying hdfs? Are there reasons this might
> be an undesirable design?****
>
>  ****
>
> Thanks in advance,
> David ****
>
> ** **
>