Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Straggler problem in Accumulo BatchScans


Copy link to this message
-
RE: Straggler problem in Accumulo BatchScans
Hi Dave,

The table is currently organizing netflow data with its rowID of timestamp_netflowRecordID, some columns corresponding to various netflow quantites, and one column representing the entire netflow in binary form.

The table is about 1.2 TB, and I am scanning 5-40 GB per scan, which scans about 7-28 tablets.

What do you mean by a custom load balancer? Do you mean balancing the data on ingest, or balancing the query load? What would you recommend for balancing the query load if I can only retrieve the data from a particular tablet?

I've played with index/data caches, though I haven't used readahead threads or max open files. Is that referring to rfiles?

I'm noticing that most of the queries are CPU bound, and that read i/o is not being hit very hard. Is that a typical behavior for scans?

Thanks,
David

From: Dave Marion [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, August 21, 2013 7:29 PM
To: [EMAIL PROTECTED]
Subject: RE: Straggler problem in Accumulo BatchScans

How is the table organized?
What percent of the table are you scanning in these large operations?
Have you considered writing a custom load balancer?

I don't think that a tablet can be hosted on multiple servers. But you might be able to play around with the index/data caches, readahead threads (concurrent queries), and max open files to achieve better performance.

From: Slater, David M. [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, August 21, 2013 7:09 PM
To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: Straggler problem in Accumulo BatchScans

Hey, I have a 7 node network running accumulo 1.4.1 and hadoop 1.0.4.

When I run large BatchScanner operations, the number of tablets scanned per node is not uniform, leading to the overloaded nodes taking much longer to finish than the others. For queries that require all of the scans to finish before returning, this is a major latency issue. What are some practical means of load-balancing this to reduce delay?

Is it possible for tablets to be hosted on multiple tablet servers, up to the replication factor of the underlying hdfs? Are there reasons this might be an undesirable design?

Thanks in advance,
David