Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Lease does not exist exceptions


Copy link to this message
-
Re: Lease does not exist exceptions
Eran Kutner 2011-10-19, 19:51
Hi J-D,
Thanks for the detailed explanation.
So if I understand correctly the lease we're talking about is a scanner
lease and the timeout is between two scanner calls, correct? I think that
make sense because I now realize that jobs that fail (some jobs continued to
fail even after reducing the number of map tasks as Stack suggested) use
filters to fetch relatively few rows out of a very large table, so they
could be spending a lot of time on the region server scanning rows until it
reached my setCaching value which was 1000. Setting the caching value to 1
seem to allow these job to complete.
I think it has to be the above, since my rows are small, with just a few
columns and processing them is very quick.

However, there are still a couple ofw thing I don't understand:
1. What is the difference between setCaching and setBatch?
2. Examining the region server logs more closely than I did yesterday I see
a log of ClosedChannelExceptions in addition to the expired leases (but no
UnknownScannerException), is that expected? You can see an excerpt of the
log from one of the region servers here: http://pastebin.com/NLcZTzsY

-eran

On Tue, Oct 18, 2011 at 23:57, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote:

> Actually the important setting is:
>
>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setCaching(int)
>
> The decides how many rows are fetched each time the client exhausts its
> local cache and goes back to the server. Reasons to have setCaching low:
>
>  - Do you have a filter on? If so it could spend some time in the region
> server trying to find all the rows
>  - Are your rows fat? It might put a lot of memory pressure in the region
> server
>  - Are you spending a lot of time on each row, like Stack was saying? This
> could also be a side effect of inserting back into HBase. The issue I hit
> recently was that I was inserting a massive table into a tiny one (in terms
> of # of regions), and I was hitting the 90 seconds sleep because of too
> many
> store files. Right there waiting that time was getting over the 60 seconds
> lease timeout.
>
> Reasons to have setCaching high:
>
>  - Lots of tiny-ish rows that you process really really fast. Basically if
> your bottleneck is just getting the rows from HBase.
>
> I found that 1000 is a good number for our rows when we process them fast,
> but that 10 is just as good if we need to spend time on each row. YMMV.
>
> With all that said, I don't know if your caching is set to anything else
> than the default of 1, so this whole discussion could be a waste.
>
>
> Anyways, here's what I do see in your case. LeaseException is a rare one,
> usually you get UnknownScannerException (could it be that you have it too?
>  Do you have a log?). Looking at HRS.next, I see that the only way to get
> this is if you race with the ScannerListener. The method does this:
>
> InternalScanner s = this.scanners.get(scannerName);
> ...
> if (s == null) throw new UnknownScannerException("Name: " + scannerName);
> ...
> lease = this.leases.removeLease(scannerName);
>
> And when a scan expires (the lease was just removed from this.leases):
>
> LOG.info("Scanner " + this.scannerName + " lease expired");
> InternalScanner s = scanners.remove(this.scannerName);
>
> Which means that your exception happens after you get the InternalScanner
> in
> next(), and before you get to this.leases.removeLease the lease expiration
> already started. If you get this all the time, there might be a bigger
> issue
> or else I would expect that you see UnknownScannerException. It could be
> due
> to locking contention, I see that there's a synchronized in removeLease in
> the leases queue, but it seems unlikely since what happens in those sync
> blocks is fast.
>
> If you do get some UnknownScannerExceptions, they will show how long you
> took before going back to the server by say like 65340ms ms passed since
> the
> last invocation, timeout is currently set to 60000 (where 65340 is a number