Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Straggler problem in Accumulo BatchScans


+
Slater, David M. 2013-08-21, 23:09
+
James Hughes 2013-08-21, 23:29
+
Slater, David M. 2013-08-21, 23:47
Copy link to this message
-
Re: Straggler problem in Accumulo BatchScans
You can write your own balancer, and use it just for your table.

-Eric
On Wed, Aug 21, 2013 at 7:47 PM, Slater, David M.
<[EMAIL PROTECTED]>wrote:

> Hi James,****
>
> ** **
>
> I already had the data sharded into 7 partitions, and that works well to
> distribute the data into 7 tablets. (I have 2 GB tablet sizes, with about
> 1.2 TB of data, so there are numerous tablets per server). The difficulty
> is that Accumulo seems to decide for itself where each tablet goes. When I
> only had 10 GB of data, it nicely divided into 7 tablets, one on each node.
> However, with dozens of tablets per tablet server, it assigns tablets to
> tablet servers orthogonally to my presplits. ****
>
> ** **
>
> Is there a way to force Accumulo to keep specific ranges on specific
> nodes? If not, I suppose that I could have 4x or more shards per tablet
> server to ensure that it was more uniformly placed.****
>
> ** **
>
> D****
>
> ** **
>
> *From:* James Hughes [mailto:[EMAIL PROTECTED]]
> *Sent:* Wednesday, August 21, 2013 7:29 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Straggler problem in Accumulo BatchScans****
>
> ** **
>
> David,****
>
> Each tablet is hosted by one tablet server, and there's no way around
> that.  (This is actually quite reasonably; otherwise, we would receive
> duplicate results from multiple tablet servers.)  ****
>
> One strategy to deal with imbalanced data is to add a random partition
> prefix to your row Ids.  This does complicate building queries, but in
> general, you'll be able to leverage all of your nodes.  I did some testing
> with the nodes of such random shard ids, and it seems like having 1-2x as
> many shards as tablet servers worked pretty well.  (I'd suggest 2x in case
> you ever grow your cloud.)****
>
> In particular, if you can reingest your data, prepend a random "01-14~" to
> the beginning of each row Id, and see if that helps.  After that, you can
> "help" Accumulo decide where it should split tablets with addSplits 01 02
> <etc> 14 from the Accumulo shell (or programmatically with the addSplits).
> After that, you can make sure that your 14+ splits are distributed across
> the 7 nodes in a reasonable way.  ****
>
> I hope that helps,
>
> Jim
>
>
> http://accumulo.apache.org/1.4/apidocs/org/apache/accumulo/core/client/admin/TableOperations.html#addSplits%28java.lang.String,%20java.util.SortedSet%29
> ****
>
> ** **
>
> On Wed, Aug 21, 2013 at 7:09 PM, Slater, David M. <[EMAIL PROTECTED]>
> wrote:****
>
> Hey, I have a 7 node network running accumulo 1.4.1 and hadoop 1.0.4.****
>
>  ****
>
> When I run large BatchScanner operations, the number of tablets scanned
> per node is not uniform, leading to the overloaded nodes taking much longer
> to finish than the others. For queries that require all of the scans to
> finish before returning, this is a major latency issue. What are some
> practical means of load-balancing this to reduce delay?****
>
>  ****
>
> Is it possible for tablets to be hosted on multiple tablet servers, up to
> the replication factor of the underlying hdfs? Are there reasons this might
> be an undesirable design?****
>
>  ****
>
> Thanks in advance,
> David ****
>
> ** **
>
+
Dave Marion 2013-08-21, 23:28
+
Slater, David M. 2013-08-21, 23:54
+
Eric Newton 2013-08-22, 00:03
+
Slater, David M. 2013-08-22, 00:12
+
dlmarion@... 2013-08-22, 00:15
+
James Hughes 2013-08-22, 14:13
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB