Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Poor HBase map-reduce scan performance


Copy link to this message
-
Re: Poor HBase map-reduce scan performance
Sandy:
Looking at patch v6 of HBASE-8420, I think it is different from your
approach below for the case of cache.size() == 0.

Maybe log a JIRA for further discussion ?

On Wed, May 22, 2013 at 3:33 PM, Sandy Pratt <[EMAIL PROTECTED]> wrote:

> It seems to be in the ballpark of what I was getting at, but I haven't
> fully digested the code yet, so I can't say for sure.
>
> Here's what I'm getting at.  Looking at
> o.a.h.h.client.ClientScanner.next() in the 94.2 source I have loaded, I
> see there are three branches with respect to the cache:
>
> public Result next() throws IOException {
>
>
>   // If the scanner is closed and there's nothing left in the cache, next
> is a no-op.
>   if (cache.size() == 0 && this.closed) {
>     return null;
>   }
>
>   if (cache.size() == 0) {
> // Request more results from RS
>   ...
>   }
>
>   if (cache.size() > 0) {
>     return cache.poll();
>   }
>
>   ...
>   return null;
>
> }
>
>
> I think that middle branch wants to change as follows (pseudo-code):
>
> if the cache size is below a certain threshold then
>   initiate asynchronous action to refill it
>   if there is no result to return until the cache refill completes then
>     block
>   done
> done
>
> Or something along those lines.  I haven't grokked the patch well enough
> yet to tell if that's what it does.  What I think is happening in the
> 0.94.2 code I've got is that it requests nothing until the cache is empty,
> then blocks until it's non-empty.  We want to eagerly and asynchronously
> refill the cache so that we ideally never have to block.
>
>
> Sandy
>
>
> On 5/22/13 1:39 PM, "Ted Yu" <[EMAIL PROTECTED]> wrote:
>
> >Sandy:
> >Do you think the following JIRA would help with what you expect in this
> >regard ?
> >
> >HBASE-8420 Port HBASE-6874 Implement prefetching for scanners from 0.89-fb
> >
> >Cheers
> >
> >On Wed, May 22, 2013 at 1:29 PM, Sandy Pratt <[EMAIL PROTECTED]> wrote:
> >
> >> I found this thread on search-hadoop.com just now because I've been
> >> wrestling with the same issue for a while and have as yet been unable to
> >> solve it.  However, I think I have an idea of the problem.  My theory is
> >> based on assumptions about what's going on in HBase and HDFS internally,
> >> so please correct me if I'm wrong.
> >>
> >> Briefly, I think the issue is that sequential reads from HDFS are
> >> pipelined, whereas sequential reads from HBase are not.  Therefore,
> >> sequential reads from HDFS tend to keep the IO subsystem saturated,
> >>while
> >> sequential reads from HBase allow it to idle for a relatively large
> >> proportion of time.
> >>
> >> To make this more concrete, suppose that I'm reading N bytes of data
> >>from
> >> a file in HDFS.  I issue the calls to open the file and begin to read
> >> (from an InputStream, for example).  As I'm reading byte 1 of the stream
> >> at my client, the datanode is reading byte M where 1 < M <= N from disk.
> >> Thus, three activities tend to happen concurrently for the most part
> >> (disregarding the beginning and end of the file): 1) processing at the
> >> client; 2) streaming over the network from datanode to client; and 3)
> >> reading data from disk at the datanode.  The proportion of time these
> >> three activities overlap tends towards 100% as N -> infinity.
> >>
> >> Now suppose I read a batch of R records from HBase (where R = whatever
> >> scanner caching happens to be).  As I understand it, I issue my call to
> >> ResultScanner.next(), and this causes the RegionServer to block as if
> >>on a
> >> page fault while it loads enough HFile blocks from disk to cover the R
> >> records I (implicitly) requested.  After the blocks are loaded into the
> >> block cache on the RS, the RS returns R records to me over the network.
> >> Then I process the R records locally.  When they are exhausted, this
> >>cycle
> >> repeats.  The notable upshot is that while the RS is faulting HFile
> >>blocks
> >> into the cache, my client is blocked.  Furthermore, while my client is
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB