Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Straggler problem in Accumulo BatchScans


+
Slater, David M. 2013-08-21, 23:09
+
James Hughes 2013-08-21, 23:29
+
Slater, David M. 2013-08-21, 23:47
+
Eric Newton 2013-08-21, 23:53
+
Dave Marion 2013-08-21, 23:28
+
Slater, David M. 2013-08-21, 23:54
+
Eric Newton 2013-08-22, 00:03
+
Slater, David M. 2013-08-22, 00:12
+
dlmarion@... 2013-08-22, 00:15
Copy link to this message
-
Re: Straggler problem in Accumulo BatchScans
David,

Have you tried the TableLoadBalancer?  I'd trying it before rolling your
own...  I think it should try to spread the tablets in your one table
across the tablet servers in a balanced way rather than balancing all of
the tablets for all tables across the nodes.

Other than that, I'd consider your key design and query plans.  If you are
routinely working with 0.5-5% of your data, I imagine things will be a
little slow in general...

Good luck!

Jim
On Wed, Aug 21, 2013 at 8:15 PM, <[EMAIL PROTECTED]> wrote:

> You can set it in the shell on the table. Just override the default tablet
> balancer for the table. I think the master has to use the Table load
> balancer also if it is not set by default.
>
> ------------------------------
> *From: *"David M. Slater" <[EMAIL PROTECTED]>
> *To: *[EMAIL PROTECTED]
> *Sent: *Wednesday, August 21, 2013 8:12:46 PM
>
> *Subject: *RE: Straggler problem in Accumulo BatchScans
>
> Thanks Eric,
>
>
>
> Just to make sure I’m going in the right direction, this would involve
> extending the TabletBalancer class, correct? How do I add it to the table
> after that (and remove the old one)? I don’t see it under the Connector’s
> TableOperations().
>
>
>
> Is using a load-balancer what you would recommend if I wanted to make sure
> that two different tables stored related information (e.g. data and
> indexes) on the same tablets?
>
>
>
> Thanks,
> David
>
>
>
> *From:* Eric Newton [mailto:[EMAIL PROTECTED]]
> *Sent:* Wednesday, August 21, 2013 8:03 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Straggler problem in Accumulo BatchScans
>
>
>
> A new balancer is a plug-in class that instructs the Master process where
> to place tablets.
>
>
>
> If you know you need your tablets spread out over servers based on time
> (row id), you can do that.  It's pretty common, in fact.
>
>
>
> -Eric
>
>
>
> On Wed, Aug 21, 2013 at 7:54 PM, Slater, David M. <[EMAIL PROTECTED]>
> wrote:
>
> Hi Dave,
>
>
>
> The table is currently organizing netflow data with its rowID of
> timestamp_netflowRecordID, some columns corresponding to various netflow
> quantites, and one column representing the entire netflow in binary form.
>
>
>
> The table is about 1.2 TB, and I am scanning 5-40 GB per scan, which scans
> about 7-28 tablets.
>
>
>
> What do you mean by a custom load balancer? Do you mean balancing the data
> on ingest, or balancing the query load? What would you recommend for
> balancing the query load if I can only retrieve the data from a particular
> tablet?
>
>
>
> I’ve played with index/data caches, though I haven’t used readahead
> threads or max open files. Is that referring to rfiles?
>
>
>
> I’m noticing that most of the queries are CPU bound, and that read i/o is
> not being hit very hard. Is that a typical behavior for scans?
>
>
>
> Thanks,
> David
>
>
>
> *From:* Dave Marion [mailto:[EMAIL PROTECTED]]
>
> *Sent:* Wednesday, August 21, 2013 7:29 PM
> *To:* [EMAIL PROTECTED]
>
> *Subject:* RE: Straggler problem in Accumulo BatchScans
>
>
>
> How is the table organized?
>
> What percent of the table are you scanning in these large operations?
>
> Have you considered writing a custom load balancer?
>
>
>
> I don’t think that a tablet can be hosted on multiple servers. But you
> might be able to play around with the index/data caches, readahead threads
> (concurrent queries), and max open files to achieve better performance.
>
>
>
> *From:* Slater, David M. [mailto:[EMAIL PROTECTED]<[EMAIL PROTECTED]>]
>
> *Sent:* Wednesday, August 21, 2013 7:09 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* Straggler problem in Accumulo BatchScans
>
>
>
> Hey, I have a 7 node network running accumulo 1.4.1 and hadoop 1.0.4.
>
>
>
> When I run large BatchScanner operations, the number of tablets scanned
> per node is not uniform, leading to the overloaded nodes taking much longer
> to finish than the others. For queries that require all of the scans to
> finish before returning, this is a major latency issue. What are some
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB