Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Poor HBase map-reduce scan performance


Copy link to this message
-
Re: Poor HBase map-reduce scan performance
Ted Yu 2013-05-22, 22:57
Sandy:
Looking at patch v6 of HBASE-8420, I think it is different from your
approach below for the case of cache.size() == 0.

Maybe log a JIRA for further discussion ?

On Wed, May 22, 2013 at 3:33 PM, Sandy Pratt <[EMAIL PROTECTED]> wrote:

> It seems to be in the ballpark of what I was getting at, but I haven't
> fully digested the code yet, so I can't say for sure.
>
> Here's what I'm getting at.  Looking at
> o.a.h.h.client.ClientScanner.next() in the 94.2 source I have loaded, I
> see there are three branches with respect to the cache:
>
> public Result next() throws IOException {
>
>
>   // If the scanner is closed and there's nothing left in the cache, next
> is a no-op.
>   if (cache.size() == 0 && this.closed) {
>     return null;
>   }
>
>   if (cache.size() == 0) {
> // Request more results from RS
>   ...
>   }
>
>   if (cache.size() > 0) {
>     return cache.poll();
>   }
>
>   ...
>   return null;
>
> }
>
>
> I think that middle branch wants to change as follows (pseudo-code):
>
> if the cache size is below a certain threshold then
>   initiate asynchronous action to refill it
>   if there is no result to return until the cache refill completes then
>     block
>   done
> done
>
> Or something along those lines.  I haven't grokked the patch well enough
> yet to tell if that's what it does.  What I think is happening in the
> 0.94.2 code I've got is that it requests nothing until the cache is empty,
> then blocks until it's non-empty.  We want to eagerly and asynchronously
> refill the cache so that we ideally never have to block.
>
>
> Sandy
>
>
> On 5/22/13 1:39 PM, "Ted Yu" <[EMAIL PROTECTED]> wrote:
>
> >Sandy:
> >Do you think the following JIRA would help with what you expect in this
> >regard ?
> >
> >HBASE-8420 Port HBASE-6874 Implement prefetching for scanners from 0.89-fb
> >
> >Cheers
> >
> >On Wed, May 22, 2013 at 1:29 PM, Sandy Pratt <[EMAIL PROTECTED]> wrote:
> >
> >> I found this thread on search-hadoop.com just now because I've been
> >> wrestling with the same issue for a while and have as yet been unable to
> >> solve it.  However, I think I have an idea of the problem.  My theory is
> >> based on assumptions about what's going on in HBase and HDFS internally,
> >> so please correct me if I'm wrong.
> >>
> >> Briefly, I think the issue is that sequential reads from HDFS are
> >> pipelined, whereas sequential reads from HBase are not.  Therefore,
> >> sequential reads from HDFS tend to keep the IO subsystem saturated,
> >>while
> >> sequential reads from HBase allow it to idle for a relatively large
> >> proportion of time.
> >>
> >> To make this more concrete, suppose that I'm reading N bytes of data
> >>from
> >> a file in HDFS.  I issue the calls to open the file and begin to read
> >> (from an InputStream, for example).  As I'm reading byte 1 of the stream
> >> at my client, the datanode is reading byte M where 1 < M <= N from disk.
> >> Thus, three activities tend to happen concurrently for the most part
> >> (disregarding the beginning and end of the file): 1) processing at the
> >> client; 2) streaming over the network from datanode to client; and 3)
> >> reading data from disk at the datanode.  The proportion of time these
> >> three activities overlap tends towards 100% as N -> infinity.
> >>
> >> Now suppose I read a batch of R records from HBase (where R = whatever
> >> scanner caching happens to be).  As I understand it, I issue my call to
> >> ResultScanner.next(), and this causes the RegionServer to block as if
> >>on a
> >> page fault while it loads enough HFile blocks from disk to cover the R
> >> records I (implicitly) requested.  After the blocks are loaded into the
> >> block cache on the RS, the RS returns R records to me over the network.
> >> Then I process the R records locally.  When they are exhausted, this
> >>cycle
> >> repeats.  The notable upshot is that while the RS is faulting HFile
> >>blocks
> >> into the cache, my client is blocked.  Furthermore, while my client is