Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> HBase parallel scanner performance


+
Narendra yadala 2012-04-19, 10:43
+
Michel Segel 2012-04-19, 11:10
+
Narendra yadala 2012-04-19, 12:03
+
Bijieshan 2012-04-19, 14:16
+
Narendra yadala 2012-04-19, 15:24
+
Bijieshan 2012-04-19, 16:28
Copy link to this message
-
Re: HBase parallel scanner performance
Narendra,

Are you trying to solve a real problem, or is this a class project?

Your solution doesn't scale. It's a non starter. 130 seconds for each iteration times 1 million seconds is how long? 130 million seconds, which is ~36000 hours or over 4 years to complete.
(the numbers are rough but you get the idea...)

That's assuming that your table is static and doesn't change.

I didn't even ask if you were attempting any sort of server side filtering which would reduce the amount of data you send back to the client because it a moot point.

Finer tuning is also moot.

So you insert a row in one table. You then do n^2 operations to pull out data.
The better solution is to insert data into 2 tables where you then have to do 2n operations to get the same results. Thats per thread btw.  So if you were running 10 threads, you would have 10n^2  operations versus 20n operations to get the same result set.

A million row table... 1*10^13. Vs 2*10^6

I don't believe I mentioned anything about HBase's internals and this solution works for any NoSQL database.
Sent from a remote device. Please excuse any typos...

Mike Segel

On Apr 19, 2012, at 7:03 AM, Narendra yadala <[EMAIL PROTECTED]> wrote:

> Hi Michel
>
> Yes, that is exactly what I do in step 2. I am aware of the reason for the
> scanner timeout exceptions. It is the time between two consecutive
> invocations of the next call on a specific scanner object. I increased the
> scanner timeout to 10 min on the region server and still I keep seeing the
> timeouts. So I reduced my scanner cache to 128.
>
> Full table scan takes 130 seconds and there are 2.2 million rows in the
> table as of now. Each row is around 2 KB in size. I measured time for the
> full table scan by issuing `count` command from the hbase shell.
>
> I kind of understood the fix that you are specifying, but do I need to
> change the table structure to fix this problem? All I do is a n^2 operation
> and even that fails with 10 different types of exceptions. It is mildly
> annoying that I need to know all the low level storage details of HBase to
> do such a simple operation. And this is happening for just 14 parallel
> scanners. I am wondering what would happen when there are thousands of
> parallel scanners.
>
> Please let me know if there is any configuration param change which would
> fix this issue.
>
> Thanks a lot
> Narendra
>
> On Thu, Apr 19, 2012 at 4:40 PM, Michel Segel <[EMAIL PROTECTED]>wrote:
>
>> So in your step 2 you have the following:
>> FOREACH row IN TABLE alpha:
>>    SELECT something
>>    FROM TABLE alpha
>>    WHERE alpha.url = row.url
>>
>> Right?
>> And you are wondering why you are getting timeouts?
>> ...
>> ...
>> And how long does it take to do a full table scan? ;-)
>> (there's more, but that's the first thing you should see...)
>>
>> Try creating a second table where you invert the URL and key pair such
>> that for each URL, you have a set of your alpha table's keys?
>>
>> Then you have the following...
>> FOREACH row IN TABLE alpha:
>>  FETCH key-set FROM beta
>>  WHERE beta.rowkey = alpha.url
>>
>> Note I use FETCH to signify that you should get a single row in response.
>>
>> Does this make sense?
>> ( your second table is actually and index of the URL column in your first
>> table)
>>
>> HTH
>>
>> Sent from a remote device. Please excuse any typos...
>>
>> Mike Segel
>>
>> On Apr 19, 2012, at 5:43 AM, Narendra yadala <[EMAIL PROTECTED]>
>> wrote:
>>
>>> I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop
>> (4*32
>>> GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution
>>> for maintaining our cluster. I have a single tweets table in which we
>> store
>>> the tweets, one tweet per row (it has millions of rows currently).
>>>
>>> Now I try to run a Java batch (not a map reduce) which does the
>> following :
>>>
>>>  1. Open a scanner over the tweet table and read the tweets one after
>>>  another. I set scanner caching to 128 rows as higher scanner caching is
+
Narendra yadala 2012-04-19, 15:04
+
Michael Segel 2012-04-19, 16:26
+
Narendra yadala 2012-04-19, 18:26
+
Michael Segel 2012-04-19, 18:33
+
S Ahmed 2012-05-19, 15:58
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB